diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 00000000..e69de29b
diff --git a/cache.json b/cache.json
new file mode 100644
index 00000000..b599ee65
--- /dev/null
+++ b/cache.json
@@ -0,0 +1 @@
+{"2024-12-24T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.18603v1","updated":"2024-12-24T18:56:46Z","published":"2024-12-24T18:56:46Z","title":"Long-Form Speech Generation with Spoken Language Models","summary":" We consider the generative modeling of speech over multiple minutes, a\nrequirement for long-form multimedia generation and audio-native voice\nassistants. However, current spoken language models struggle to generate\nplausible speech past tens of seconds, from high temporal resolution of speech\ntokens causing loss of coherence, to architectural issues with long-sequence\ntraining or extrapolation, to memory costs at inference time. With these\nconsiderations we propose SpeechSSM, the first speech language model to learn\nfrom and sample long-form spoken audio (e.g., 16 minutes of read or\nextemporaneous speech) in a single decoding session without text intermediates,\nbased on recent advances in linear-time sequence modeling. Furthermore, to\naddress growing challenges in spoken language evaluation, especially in this\nnew long-form setting, we propose: new embedding-based and LLM-judged metrics;\nquality measurements over length and time; and a new benchmark for long-form\nspeech processing and generation, LibriSpeech-Long. Speech samples and the\ndataset are released at\nhttps://google.github.io/tacotron/publications/speechssm/\n","authors":["Se Jin Park","Julian Salazar","Aren Jansen","Keisuke Kinoshita","Yong Man Ro","RJ Skerry-Ryan"],"pdf_url":"https://arxiv.org/pdf/2412.18603v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18582v1","updated":"2024-12-24T18:18:52Z","published":"2024-12-24T18:18:52Z","title":"Exploring Embedding Priors in Prompt-Tuning for Improved\n Interpretability and Control","summary":" Prompt-Tuning is an efficient method for adapting pre-trained language models\nto new tasks with minimal computational overhead by modifying prompt\nembeddings. In this work, we investigate how crucial the phenomenon of\nembedding collapse, frequently observed in Prompt-Tuning, is for the final\nperformance of the model. To address this question, we designed embedding\npriors and compared them with posteriors of the converged Soft and Deep\nPrompt-Tuning methods. Our findings suggest that priors strongly affect the\nposition of the tuned embeddings, and models can effectively work with\nembeddings from different parts of activation spaces, including completely new\nregions. As the final Prompt-Tuning capabilities are limited, we hypothesize\nthat controllable Prompt-Tuning posteriors may serve as a good starting point\nfor tasks such as chain-of-thought (COT) distillation. Our experiments also\nshow that generated trajectories are not localized in the activation space of\nthe models. However, there are distinct clusters of activations for distant\ntasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g.,\nQuestion-Answering and MLM) lie in the same cluster. These observations raise\nquestions about the importance of a single activation cluster for the\ngeneralization abilities of large language models.\n","authors":["Sergey Sedov","Sumanth Bharadwaj Hachalli Karanam","Venu Gopal Kadamba"],"pdf_url":"https://arxiv.org/pdf/2412.18582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10924v3","updated":"2024-12-24T17:56:50Z","published":"2024-12-14T18:18:52Z","title":"Tokens, the oft-overlooked appetizer: Large language models, the\n distributional hypothesis, and meaning","summary":" Tokenization is a necessary component within the current architecture of many\nlanguage models, including the transformer-based large language models (LLMs)\nof Generative AI, yet its impact on the model's cognition is often overlooked.\nWe argue that LLMs demonstrate that the Distributional Hypothesis (DH) is\nsufficient for reasonably human-like language performance, and that the\nemergence of human-meaningful linguistic units among tokens motivates\nlinguistically-informed interventions in existing, linguistically-agnostic\ntokenization techniques, particularly with respect to their roles as (1)\nsemantic primitives and as (2) vehicles for conveying salient distributional\npatterns from human language to the model. We explore tokenizations from a BPE\ntokenizer; extant model vocabularies obtained from Hugging Face and tiktoken;\nand the information in exemplar token vectors as they move through the layers\nof a RoBERTa (large) model. Besides creating sub-optimal semantic building\nblocks and obscuring the model's access to the necessary distributional\npatterns, we describe how tokenization pretraining can be a backdoor for bias\nand other unwanted content, which current alignment practices may not\nremediate. Additionally, we relay evidence that the tokenization algorithm's\nobjective function impacts the LLM's cognition, despite being meaningfully\ninsulated from the main system intelligence.\n","authors":["Julia Witte Zimmerman","Denis Hudon","Kathryn Cramer","Alejandro J. Ruiz","Calla Beauregard","Ashley Fehr","Mikaela Irene Fudolig","Bradford Demarest","Yoshi Meke Bird","Milo Z. Trujillo","Christopher M. Danforth","Peter Sheridan Dodds"],"pdf_url":"https://arxiv.org/pdf/2412.10924v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18573v1","updated":"2024-12-24T17:56:08Z","published":"2024-12-24T17:56:08Z","title":"How Well Do LLMs Generate Code for Different Application Domains?\n Benchmark and Evaluation","summary":" Recently, an increasing number of AI-driven programming assistants powered by\ncode LLMs have been integrated into various real-world software development\nenvironments, significantly boosting developer productivity. However, existing\ncode generation benchmarks primarily focus on general-purpose scenarios,\nleaving the code generation performance of LLMs for specific application\ndomains largely unknown. In this paper, we introduce a new benchmark,\nMultiCodeBench, to fill this gap. MultiCodeBench comprises 2,400 programming\ntasks, covering 12 popular software development domains and 15 programming\nlanguages. Specifically, we perform in-depth research to identify these 12\napplication domains. Given that each domain may involve multiple technical\nframeworks, and that different frameworks present distinct challenges in the\ncoding process, we categorize the commonly used frameworks and platforms within\neach domain. We then sample programming problems from GitHub repositories\nrelated to these subdomains. To ensure the quality of the tasks and mitigate\ndata leakage issues, we invite annotators to rewrite the docstrings for each\ntask in MultiCodeBench. Additionally, we build a static analysis-based\ndependency parsing tool to extract the dependencies in the ground truth for\neach task, enabling deeper performance analysis. Through extensive experiments\non MultiCodeBench with eleven representative mainstream LLMs, we reveal the\ncode generation performance of the LLMs across different application domains,\nproviding practical insights for developers in downstream fields when selecting\nLLMs. Furthermore, we analyze the reasons behind the models' failures in\ncompleting software application development tasks, offering guidance for model\ndevelopers to enhance domain-specific code generation capabilities.\n","authors":["Dewu Zheng","Yanlin Wang","Ensheng Shi","Hongyu Zhang","Zibin Zheng"],"pdf_url":"https://arxiv.org/pdf/2412.18573v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16135v2","updated":"2024-12-24T17:50:01Z","published":"2024-12-20T18:31:24Z","title":"Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models\n into Assembly Code Obfuscation","summary":" Malware authors often employ code obfuscations to make their malware harder\nto detect. Existing tools for generating obfuscated code often require access\nto the original source code (e.g., C++ or Java), and adding new obfuscations is\na non-trivial, labor-intensive process. In this study, we ask the following\nquestion: Can Large Language Models (LLMs) potentially generate a new\nobfuscated assembly code? If so, this poses a risk to anti-virus engines and\npotentially increases the flexibility of attackers to create new obfuscation\npatterns. We answer this in the affirmative by developing the MetamorphASM\nbenchmark comprising MetamorphASM Dataset (MAD) along with three code\nobfuscation techniques: dead code, register substitution, and control flow\nchange. The MetamorphASM systematically evaluates the ability of LLMs to\ngenerate and analyze obfuscated code using MAD, which contains 328,200\nobfuscated assembly code samples. We release this dataset and analyze the\nsuccess rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder,\nCodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly\ncode. The evaluation was performed using established information-theoretic\nmetrics and manual human review to ensure correctness and provide the\nfoundation for researchers to study and develop remediations to this risk. The\nsource code can be found at the following GitHub link:\nhttps://github.com/mohammadi-ali/MetamorphASM.\n","authors":["Seyedreza Mohseni","Seyedali Mohammadi","Deepa Tilwani","Yash Saxena","Gerald Ndawula","Sriram Vema","Edward Raff","Manas Gaur"],"pdf_url":"https://arxiv.org/pdf/2412.16135v2.pdf","comment":"To appear in AAAI 2025, Main Track"},{"id":"http://arxiv.org/abs/2412.18566v1","updated":"2024-12-24T17:37:11Z","published":"2024-12-24T17:37:11Z","title":"Zero-resource Speech Translation and Recognition with LLMs","summary":" Despite recent advancements in speech processing, zero-resource speech\ntranslation (ST) and automatic speech recognition (ASR) remain challenging\nproblems. In this work, we propose to leverage a multilingual Large Language\nModel (LLM) to perform ST and ASR in languages for which the model has never\nseen paired audio-text data. We achieve this by using a pre-trained\nmultilingual speech encoder, a multilingual LLM, and a lightweight adaptation\nmodule that maps the audio representations to the token embedding space of the\nLLM. We perform several experiments both in ST and ASR to understand how to\nbest train the model and what data has the most impact on performance in\npreviously unseen languages. In ST, our best model is capable to achieve BLEU\nscores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we\nachieve WERs of up to 28.2\\%. We finally show that the performance of our\nsystem is bounded by the ability of the LLM to output text in the desired\nlanguage.\n","authors":["Karel Mundnich","Xing Niu","Prashant Mathur","Srikanth Ronanki","Brady Houston","Veera Raghavendra Elluru","Nilaksh Das","Zejiang Hou","Goeric Huybrechts","Anshu Bhatia","Daniel Garcia-Romero","Kyu J. Han","Katrin Kirchhoff"],"pdf_url":"https://arxiv.org/pdf/2412.18566v1.pdf","comment":"ICASSP 2025, 5 pages, 2 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.18552v1","updated":"2024-12-24T17:05:26Z","published":"2024-12-24T17:05:26Z","title":"Distilling Fine-grained Sentiment Understanding from Large Language\n Models","summary":" Fine-grained sentiment analysis (FSA) aims to extract and summarize user\nopinions from vast opinionated text. Recent studies demonstrate that large\nlanguage models (LLMs) possess exceptional sentiment understanding\ncapabilities. However, directly deploying LLMs for FSA applications incurs high\ninference costs. Therefore, this paper investigates the distillation of\nfine-grained sentiment understanding from LLMs into small language models\n(SLMs). We prompt LLMs to examine and interpret the sentiments of given reviews\nand then utilize the generated content to pretrain SLMs. Additionally, we\ndevelop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. Extensive\nexperiments on this benchmark reveal that: (1) distillation significantly\nenhances the performance of SLMs in FSA tasks, achieving a 6.00\\% improvement\nin $F_1$-score, and the distilled model can outperform Llama-2-7b with only\n220M parameters; (2) distillation equips SLMs with excellent zero-shot\nsentiment classification capabilities, enabling them to match or even exceed\ntheir teacher models. These results suggest that distillation from LLMs is a\nhighly promising direction for FSA. We will release our code, data, and\npretrained model weights at\n\\url{https://github.com/HITSZ-HLT/FSA-Distillation}.\n","authors":["Yice Zhang","Guangyu Xie","Hongling Xu","Kaiheng Hou","Jianzhu Bao","Qianlong Wang","Shiwei Chen","Ruifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2412.18552v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18551v1","updated":"2024-12-24T17:03:44Z","published":"2024-12-24T17:03:44Z","title":"Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard\n of Safety and Capability","summary":" To address this gap, we introduce Libra-Leaderboard, a comprehensive\nframework designed to rank LLMs through a balanced evaluation of performance\nand safety. Combining a dynamic leaderboard with an interactive LLM arena,\nLibra-Leaderboard encourages the joint optimization of capability and safety.\nUnlike traditional approaches that average performance and safety metrics,\nLibra-Leaderboard uses a distance-to-optimal-score method to calculate the\noverall rankings. This approach incentivizes models to achieve a balance rather\nthan excelling in one dimension at the expense of some other ones. In the first\nrelease, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading\norganizations, identifying critical safety challenges even in state-of-the-art\nmodels.\n","authors":["Haonan Li","Xudong Han","Zenan Zhai","Honglin Mu","Hao Wang","Zhenxuan Zhang","Yilin Geng","Shom Lin","Renxi Wang","Artem Shelmanov","Xiangyu Qi","Yuxia Wang","Donghai Hong","Youliang Yuan","Meng Chen","Haoqin Tu","Fajri Koto","Tatsuki Kuribayashi","Cong Zeng","Rishabh Bhardwaj","Bingchen Zhao","Yawen Duan","Yi Liu","Emad A. Alghamdi","Yaodong Yang","Yinpeng Dong","Soujanya Poria","Pengfei Liu","Zhengzhong Liu","Xuguang Ren","Eduard Hovy","Iryna Gurevych","Preslav Nakov","Monojit Choudhury","Timothy Baldwin"],"pdf_url":"https://arxiv.org/pdf/2412.18551v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18547v1","updated":"2024-12-24T16:55:45Z","published":"2024-12-24T16:55:45Z","title":"Token-Budget-Aware LLM Reasoning","summary":" Reasoning is critical for large language models (LLMs) to excel in a wide\nrange of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM\nperformance by decomposing problems into intermediate steps, they also incur\nsignificant overhead in token usage, leading to increased costs. We find that\nthe reasoning process of current LLMs is unnecessarily lengthy and it can be\ncompressed by including a reasonable token budget in the prompt, but the choice\nof token budget plays a crucial role in the actual compression effectiveness.\nWe then propose a token-budget-aware LLM reasoning framework, which dynamically\nestimates token budgets for different problems based on reasoning complexity\nand uses the estimated token budgets to guide the reasoning process.\nExperiments show that our method effectively reduces token costs in CoT\nreasoning with only a slight performance reduction, offering a practical\nsolution to balance efficiency and accuracy in LLM reasoning. Code:\nhttps://github.com/GeniusHTX/TALE.\n","authors":["Tingxu Han","Chunrong Fang","Shiyu Zhao","Shiqing Ma","Zhenyu Chen","Zhenting Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18547v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18544v1","updated":"2024-12-24T16:51:35Z","published":"2024-12-24T16:51:35Z","title":"Consistency Checks for Language Model Forecasters","summary":" Forecasting is a task that is difficult to evaluate: the ground truth can\nonly be known in the future. Recent work showing LLM forecasters rapidly\napproaching human-level performance begs the question: how can we benchmark and\nevaluate these forecasters instantaneously? Following the consistency check\nframework, we measure the performance of forecasters in terms of the\nconsistency of their predictions on different logically-related questions. We\npropose a new, general consistency metric based on arbitrage: for example, if a\nforecasting AI illogically predicts that both the Democratic and Republican\nparties have 60% probability of winning the 2024 US presidential election, an\narbitrageur can trade against the forecaster's predictions and make a profit.\nWe build an automated evaluation system that generates a set of base questions,\ninstantiates consistency checks from these questions, elicits the predictions\nof the forecaster, and measures the consistency of the predictions. We then\nbuild a standard, proper-scoring-rule forecasting benchmark, and show that our\n(instantaneous) consistency metrics correlate with LLM forecasters' ground\ntruth Brier scores (which are only known in the future). We also release a\nconsistency benchmark that resolves in 2028, providing a long-term evaluation\ntool for forecasting.\n","authors":["Daniel Paleka","Abhimanyu Pallavi Sudhir","Alejandro Alvarez","Vineeth Bhat","Adam Shen","Evan Wang","Florian Tramèr"],"pdf_url":"https://arxiv.org/pdf/2412.18544v1.pdf","comment":"56 pages, 25 figures. Submitted to ICLR 2025"},{"id":"http://arxiv.org/abs/2412.12564v2","updated":"2024-12-24T16:41:40Z","published":"2024-12-17T05:48:48Z","title":"Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with\n Large Language Models","summary":" Aspect-based sentiment analysis (ABSA), a sequence labeling task, has\nattracted increasing attention in multilingual contexts. While previous\nresearch has focused largely on fine-tuning or training models specifically for\nABSA, we evaluate large language models (LLMs) under zero-shot conditions to\nexplore their potential to tackle this challenge with minimal task-specific\nadaptation. We conduct a comprehensive empirical evaluation of a series of LLMs\non multilingual ABSA tasks, investigating various prompting strategies,\nincluding vanilla zero-shot, chain-of-thought (CoT), self-improvement,\nself-debate, and self-consistency, across nine different models. Results\nindicate that while LLMs show promise in handling multilingual ABSA, they\ngenerally fall short of fine-tuned, task-specific models. Notably, simpler\nzero-shot prompts often outperform more complex strategies, especially in\nhigh-resource languages like English. These findings underscore the need for\nfurther refinement of LLM-based approaches to effectively address ABSA task\nacross diverse languages.\n","authors":["Chengyan Wu","Bolei Ma","Zheyu Zhang","Ningyuan Deng","Yanqing He","Yun Xue"],"pdf_url":"https://arxiv.org/pdf/2412.12564v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18537v1","updated":"2024-12-24T16:38:04Z","published":"2024-12-24T16:38:04Z","title":"Harnessing Large Language Models for Knowledge Graph Question Answering\n via Adaptive Multi-Aspect Retrieval-Augmentation","summary":" Large Language Models (LLMs) demonstrate remarkable capabilities, yet\nstruggle with hallucination and outdated knowledge when tasked with complex\nknowledge reasoning, resulting in factually incorrect outputs. Previous studies\nhave attempted to mitigate it by retrieving factual knowledge from large-scale\nknowledge graphs (KGs) to assist LLMs in logical reasoning and prediction of\nanswers. However, this kind of approach often introduces noise and irrelevant\ndata, especially in situations with extensive context from multiple knowledge\naspects. In this way, LLM attention can be potentially mislead from question\nand relevant information. In our study, we introduce an Adaptive Multi-Aspect\nRetrieval-augmented over KGs (Amar) framework. This method retrieves knowledge\nincluding entities, relations, and subgraphs, and converts each piece of\nretrieved text into prompt embeddings. The Amar framework comprises two key\nsub-components: 1) a self-alignment module that aligns commonalities among\nentities, relations, and subgraphs to enhance retrieved text, thereby reducing\nnoise interference; 2) a relevance gating module that employs a soft gate to\nlearn the relevance score between question and multi-aspect retrieved data, to\ndetermine which information should be used to enhance LLMs' output, or even\nfiltered altogether. Our method has achieved state-of-the-art performance on\ntwo common datasets, WebQSP and CWQ, showing a 1.9\\% improvement in accuracy\nover its best competitor and a 6.6\\% improvement in logical form generation\nover a method that directly uses retrieved text as context prompts. These\nresults demonstrate the effectiveness of Amar in improving the reasoning of\nLLMs.\n","authors":["Derong Xu Xinhang Li","Ziheng Zhang","Zhenxi Lin","Zhihong Zhu","Zhi Zheng","Xian Wu","Xiangyu Zhao","Tong Xu","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.18537v1.pdf","comment":"Accepted by AAAI'2025"},{"id":"http://arxiv.org/abs/2408.14909v2","updated":"2024-12-24T16:25:27Z","published":"2024-08-27T09:35:49Z","title":"SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking\n State Space Models","summary":" Known as low energy consumption networks, spiking neural networks (SNNs) have\ngained a lot of attention within the past decades. While SNNs are increasing\ncompetitive with artificial neural networks (ANNs) for vision tasks, they are\nrarely used for long sequence tasks, despite their intrinsic temporal dynamics.\nIn this work, we develop spiking state space models (SpikingSSMs) for long\nsequence learning by leveraging on the sequence learning abilities of state\nspace models (SSMs). Inspired by dendritic neuron structure, we hierarchically\nintegrate neuronal dynamics with the original SSM block, meanwhile realizing\nsparse synaptic computation. Furthermore, to solve the conflict of event-driven\nneuronal dynamics with parallel computing, we propose a light-weight surrogate\ndynamic network which accurately predicts the after-reset membrane potential\nand compatible to learnable thresholds, enabling orders of acceleration in\ntraining speed compared with conventional iterative methods. On the long range\narena benchmark task, SpikingSSM achieves competitive performance to\nstate-of-the-art SSMs meanwhile realizing on average 90\\% of network sparsity.\nOn language modeling, our network significantly surpasses existing spiking\nlarge language models (spikingLLMs) on the WikiText-103 dataset with only a\nthird of the model size, demonstrating its potential as backbone architecture\nfor low computation cost LLMs.\n","authors":["Shuaijie Shen","Chao Wang","Renzhuo Huang","Yan Zhong","Qinghai Guo","Zhichao Lu","Jianguo Zhang","Luziwei Leng"],"pdf_url":"https://arxiv.org/pdf/2408.14909v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18530v1","updated":"2024-12-24T16:24:43Z","published":"2024-12-24T16:24:43Z","title":"Characterizations of Language Generation With Breadth","summary":" We study language generation in the limit, introduced by Kleinberg and\nMullainathan [KM24], building on classical works of Gold [Gol67] and Angluin\n[Ang79]. [KM24] proposed an algorithm that generates strings from any countable\nlanguage collection in the limit. While their algorithm eventually outputs\nstrings from the target language $K$, it sacrifices breadth, i.e., the ability\nto generate all strings in $K$. A key open question in [KM24] is whether this\ntrade-off between consistency and breadth is inherrent.\n Recent works proposed different notions of consistent generation with\nbreadth. Kalavasis, Mehrotra, and Velegkas [KVM24] introduced three\ndefinitions: generation with exact breadth, approximate breadth, and\nunambiguous generation. Concurrently and independently, Charikar and Pabbaraju\n[CP24a] proposed exhaustive generation. Both works examined when generation\nwith these notions of breadth is possible.\n Building on [CP24a, KVM24], we fully characterize language generation for\nthese notions and their natural combinations. For exact breadth, we provide an\nunconditional lower bound, removing a technical condition from [KVM24] and\nextending the result of [CP24a] that holds for specific collections of\nlanguages. We show that generation with exact breadth is characterized by\nAngluin's condition for identification. We further introduce a weaker version\nof Angluin's condition that tightly characterizes both approximate breadth and\nexhaustive generation, proving their equivalence. Additionally, we show that\nunambiguous generation is also characterized by Angluin's condition as a\nspecial case of a broader result. Finally, we strengthen [KVM24] by giving\nunconditional lower bounds for stable generators, showing that Angluin's\ncondition characterizes the previous breadth notions for stable generators.\nThis shows a separation between stable and unstable generation with approximate\nbreadth.\n","authors":["Alkis Kalavasis","Anay Mehrotra","Grigoris Velegkas"],"pdf_url":"https://arxiv.org/pdf/2412.18530v1.pdf","comment":"Abstract shortened to fix arXiv limit"},{"id":"http://arxiv.org/abs/2412.17743v2","updated":"2024-12-24T16:07:47Z","published":"2024-12-23T17:47:53Z","title":"YuLan-Mini: An Open Data-efficient Language Model","summary":" Effective pre-training of large language models (LLMs) has been challenging\ndue to the immense resource demands and the complexity of the technical\nprocesses involved. This paper presents a detailed technical report on\nYuLan-Mini, a highly capable base model with 2.42B parameters that achieves\ntop-tier performance among models of similar parameter scale. Our pre-training\napproach focuses on enhancing training efficacy through three key technical\ncontributions: an elaborate data pipeline combines data cleaning with data\nschedule strategies, a robust optimization method to mitigate training\ninstability, and an effective annealing approach that incorporates targeted\ndata selection and long context training. Remarkably, YuLan-Mini, trained on\n1.08T tokens, achieves performance comparable to industry-leading models that\nrequire significantly more data. To facilitate reproduction, we release the\nfull details of the data composition for each training phase. Project details\ncan be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.\n","authors":["Yiwen Hu","Huatong Song","Jia Deng","Jiapeng Wang","Jie Chen","Kun Zhou","Yutao Zhu","Jinhao Jiang","Zican Dong","Wayne Xin Zhao","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2412.17743v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18497v1","updated":"2024-12-24T15:28:56Z","published":"2024-12-24T15:28:56Z","title":"Think or Remember? Detecting and Directing LLMs Towards Memorization or\n Generalization","summary":" In this paper, we explore the foundational mechanisms of memorization and\ngeneralization in Large Language Models (LLMs), inspired by the functional\nspecialization observed in the human brain. Our investigation serves as a case\nstudy leveraging specially designed datasets and experimental-scale LLMs to lay\nthe groundwork for understanding these behaviors. Specifically, we aim to first\nenable LLMs to exhibit both memorization and generalization by training with\nthe designed dataset, then (a) examine whether LLMs exhibit neuron-level\nspatial differentiation for memorization and generalization, (b) predict these\nbehaviors using model internal representations, and (c) steer the behaviors\nthrough inference-time interventions. Our findings reveal that neuron-wise\ndifferentiation of memorization and generalization is observable in LLMs, and\ntargeted interventions can successfully direct their behavior.\n","authors":["Yi-Fu Fu","Yu-Chieh Tu","Tzu-Ling Cheng","Cheng-Yu Lin","Yi-Ting Yang","Heng-Yi Liu","Keng-Te Liao","Da-Cheng Juan","Shou-De Lin"],"pdf_url":"https://arxiv.org/pdf/2412.18497v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18496v1","updated":"2024-12-24T15:28:41Z","published":"2024-12-24T15:28:41Z","title":"Generating event descriptions under syntactic and semantic constraints","summary":" With the goal of supporting scalable lexical semantic annotation, analysis,\nand theorizing, we conduct a comprehensive evaluation of different methods for\ngenerating event descriptions under both syntactic constraints -- e.g. desired\nclause structure -- and semantic constraints -- e.g. desired verb sense. We\ncompare three different methods -- (i) manual generation by experts; (ii)\nsampling from a corpus annotated for syntactic and semantic information; and\n(iii) sampling from a language model (LM) conditioned on syntactic and semantic\ninformation -- along three dimensions of the generated event descriptions: (a)\nnaturalness, (b) typicality, and (c) distinctiveness. We find that all methods\nreliably produce natural, typical, and distinctive event descriptions, but that\nmanual generation continues to produce event descriptions that are more\nnatural, typical, and distinctive than the automated generation methods. We\nconclude that the automated methods we consider produce event descriptions of\nsufficient quality for use in downstream annotation and analysis insofar as the\nmethods used for this annotation and analysis are robust to a small amount of\ndegradation in the resulting event descriptions.\n","authors":["Angela Cao","Faye Holt","Jonas Chan","Stephanie Richter","Lelia Glass","Aaron Steven White"],"pdf_url":"https://arxiv.org/pdf/2412.18496v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18495v1","updated":"2024-12-24T15:26:31Z","published":"2024-12-24T15:26:31Z","title":"How \"Real\" is Your Real-Time Simultaneous Speech-to-Text Translation\n System?","summary":" Simultaneous speech-to-text translation (SimulST) translates source-language\nspeech into target-language text concurrently with the speaker's speech,\nensuring low latency for better user comprehension. Despite its intended\napplication to unbounded speech, most research has focused on human\npre-segmented speech, simplifying the task and overlooking significant\nchallenges. This narrow focus, coupled with widespread terminological\ninconsistencies, is limiting the applicability of research outcomes to\nreal-world applications, ultimately hindering progress in the field. Our\nextensive literature review of 110 papers not only reveals these critical\nissues in current research but also serves as the foundation for our key\ncontributions. We 1) define the steps and core components of a SimulST system,\nproposing a standardized terminology and taxonomy; 2) conduct a thorough\nanalysis of community trends, and 3) offer concrete recommendations and future\ndirections to bridge the gaps in existing literature, from evaluation\nframeworks to system architectures, for advancing the field towards more\nrealistic and effective SimulST solutions.\n","authors":["Sara Papi","Peter Polak","Ondřej Bojar","Dominik Macháček"],"pdf_url":"https://arxiv.org/pdf/2412.18495v1.pdf","comment":"Accepted at TACL"},{"id":"http://arxiv.org/abs/2412.18487v1","updated":"2024-12-24T15:18:52Z","published":"2024-12-24T15:18:52Z","title":"Segment-Based Attention Masking for GPTs","summary":" Modern Language Models (LMs) owe much of their success to masked causal\nattention, the backbone of Generative Pre-Trained Transformer (GPT) models.\nAlthough GPTs can process the entire user prompt at once, the causal masking is\napplied to all input tokens step-by-step, mimicking the generation process.\nThis imposes an unnecessary constraint during the initial \"prefill\" phase when\nthe model processes the input prompt and generates the internal representations\nbefore producing any output tokens. In this work, attention is masked based on\nthe known block structure at the prefill phase, followed by the conventional\ntoken-by-token autoregressive process after that. For example, in a typical\nchat prompt, the system prompt is treated as one block, and the user prompt as\nthe next one. Each of these is treated as a unit for the purpose of masking,\nsuch that the first tokens in each block can access the subsequent tokens in a\nnon-causal manner. Then, the model answer is generated in the conventional\ncausal manner. This Segment-by-Segment scheme entails no additional\ncomputational overhead. When integrating it into models such as Llama and Qwen,\nstate-of-the-art performance is consistently achieved.\n","authors":["Shahar Katz","Liran Ringel","Yaniv Romano","Lior Wolf"],"pdf_url":"https://arxiv.org/pdf/2412.18487v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.04220v4","updated":"2024-12-24T15:08:40Z","published":"2024-06-06T16:18:30Z","title":"BEADs: Bias Evaluation Across Domains","summary":" Recent advancements in large language models (LLMs) have greatly enhanced\nnatural language processing (NLP) applications. Nevertheless, these models\noften inherit biases from their training data. Despite the availability of\nvarious datasets for bias detection, most are limited to one or two NLP tasks\n(typically classification or evaluation) and lack comprehensive evaluations\nacross a broader range of NLP tasks. To address this gap, we introduce the Bias\nEvaluations Across Domains BEADs dataset, designed to support a wide array of\nNLP tasks, including text classification, token classification, bias\nquantification, and benign language generation. A key focus of this paper is\nthe gold label dataset that is annotated by GPT4 for scalabilty and verified by\nexperts to ensure high reliability. BEADs provides data for both fine-tuning,\nincluding classification and language generation tasks, and for evaluating\nLLMs. Our findings indicate that BEADs effectively identifies numerous biases\nwhen fine-tuned on this dataset. It also reduces biases when used for\nfine-tuning language generation task, while preserving language quality. The\nresults also reveal some prevalent demographic biases in LLMs when BEADs is\nused for evaluation in demographic task. We provide the BEADs dataset for\ndetecting biases in various domains, and this dataset is readily usable for\nresponsible AI development and application. The dataset can be accessed at\nhttps://huggingface.co/datasets/shainar/BEAD .\n","authors":["Shaina Raza","Mizanur Rahman","Michael R. Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.04220v4.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2408.11006v3","updated":"2024-12-24T15:04:50Z","published":"2024-08-20T17:00:04Z","title":"Security Attacks on LLM-based Code Completion Tools","summary":" The rapid development of large language models (LLMs) has significantly\nadvanced code completion capabilities, giving rise to a new generation of\nLLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these\ntools possess unique workflows, integrating multiple information sources as\ninput and prioritizing code suggestions over natural language interaction,\nwhich introduces distinct security challenges. Additionally, LCCTs often rely\non proprietary code datasets for training, raising concerns about the potential\nexposure of sensitive data. This paper exploits these distinct characteristics\nof LCCTs to develop targeted attack methodologies on two critical security\nrisks: jailbreaking and training data extraction attacks. Our experimental\nresults expose significant vulnerabilities within LCCTs, including a 99.4%\nsuccess rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate\non Amazon Q. Furthermore, We successfully extracted sensitive user data from\nGitHub Copilot, including 54 real email addresses and 314 physical addresses\nassociated with GitHub usernames. Our study also demonstrates that these\ncode-based attack methods are effective against general-purpose LLMs, such as\nthe GPT series, highlighting a broader security misalignment in the handling of\ncode by modern LLMs. These findings underscore critical security challenges\nassociated with LCCTs and suggest essential directions for strengthening their\nsecurity frameworks. The example code and attack samples from our research are\nprovided at https://github.com/Sensente/Security-Attacks-on-LCCTs.\n","authors":["Wen Cheng","Ke Sun","Xinyu Zhang","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2408.11006v3.pdf","comment":"Paper accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2406.18118v4","updated":"2024-12-24T14:26:36Z","published":"2024-06-26T07:15:44Z","title":"SafeAligner: Safety Alignment against Jailbreak Attacks via Response\n Disparity Guidance","summary":" As the development of large language models (LLMs) rapidly advances, securing\nthese models effectively without compromising their utility has become a\npivotal area of research. However, current defense strategies against jailbreak\nattacks (i.e., efforts to bypass security protocols) often suffer from limited\nadaptability, restricted general capability, and high cost. To address these\nchallenges, we introduce SafeAligner, a methodology implemented at the decoding\nstage to fortify defenses against jailbreak attacks. We begin by developing two\nspecialized models: the Sentinel Model, which is trained to foster safety, and\nthe Intruder Model, designed to generate riskier responses. SafeAligner\nleverages the disparity in security levels between the responses from these\nmodels to differentiate between harmful and beneficial tokens, effectively\nguiding the safety alignment by altering the output token distribution of the\ntarget model. Extensive experiments show that SafeAligner can increase the\nlikelihood of beneficial tokens, while reducing the occurrence of harmful ones,\nthereby ensuring secure alignment with minimal loss to generality.\n","authors":["Caishuang Huang","Wanxu Zhao","Rui Zheng","Huijie Lv","Wenyu Zhan","Shihan Dou","Sixian Li","Xiao Wang","Enyu Zhou","Junjie Ye","Yuming Yang","Tao Gui","Qi Zhang","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2406.18118v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18443v1","updated":"2024-12-24T14:03:07Z","published":"2024-12-24T14:03:07Z","title":"Is Large Language Model Good at Triple Set Prediction? An Empirical\n Study","summary":" The core of the Knowledge Graph Completion (KGC) task is to predict and\ncomplete the missing relations or nodes in a KG. Common KGC tasks are mostly\nabout inferring unknown elements with one or two elements being known in a\ntriple. In comparison, the Triple Set Prediction (TSP) task is a more realistic\nknowledge graph completion task. It aims to predict all elements of unknown\ntriples based on the information from known triples. In recent years, large\nlanguage models (LLMs) have exhibited significant advancements in language\ncomprehension, demonstrating considerable potential for KGC tasks. However, the\npotential of LLM on the TSP task has not yet to be investigated. Thus in this\npaper we proposed a new framework to explore the strengths and limitations of\nLLM in the TSP task. Specifically, the framework consists of LLM-based rule\nmining and LLM-based triple set prediction. The relation list of KG embedded\nwithin rich semantic information is first leveraged to prompt LLM in the\ngeneration of rules. This process is both efficient and independent of\nstatistical information, making it easier to mine effective and realistic\nrules. For each subgraph, the specified rule is applied in conjunction with the\nrelevant triples within that subgraph to guide the LLM in predicting the\nmissing triples. Subsequently, the predictions from all subgraphs are\nconsolidated to derive the complete set of predicted triples on KG. Finally,\nthe method is evaluated on the relatively complete CFamily dataset. The\nexperimental results indicate that when LLMs are required to adhere to a large\namount of factual knowledge to predict missing triples, significant\nhallucinations occurs, leading to a noticeable decline in performance. To\nfurther explore the causes of this phenomenon, this paper presents a\ncomprehensive analysis supported by a detailed case study.\n","authors":["Yuan Yuan","Yajing Xu","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18440v1","updated":"2024-12-24T13:59:23Z","published":"2024-12-24T13:59:23Z","title":"Unlocking the Potential of Multiple BERT Models for Bangla Question\n Answering in NCTB Textbooks","summary":" Evaluating text comprehension in educational settings is critical for\nunderstanding student performance and improving curricular effectiveness. This\nstudy investigates the capability of state-of-the-art language models-RoBERTa\nBase, Bangla-BERT, and BERT Base-in automatically assessing Bangla\npassage-based question-answering from the National Curriculum and Textbook\nBoard (NCTB) textbooks for classes 6-10. A dataset of approximately 3,000\nBangla passage-based question-answering instances was compiled, and the models\nwere evaluated using F1 Score and Exact Match (EM) metrics across various\nhyperparameter configurations. Our findings revealed that Bangla-BERT\nconsistently outperformed the other models, achieving the highest F1 (0.75) and\nEM (0.53) scores, particularly with smaller batch sizes, the inclusion of stop\nwords, and a moderate learning rate. In contrast, RoBERTa Base demonstrated the\nweakest performance, with the lowest F1 (0.19) and EM (0.27) scores under\ncertain configurations. The results underscore the importance of fine-tuning\nhyperparameters for optimizing model performance and highlight the potential of\nmachine learning models in evaluating text comprehension in educational\ncontexts. However, limitations such as dataset size, spelling inconsistencies,\nand computational constraints emphasize the need for further research to\nenhance the robustness and applicability of these models. This study lays the\ngroundwork for the future development of automated evaluation systems in\neducational institutions, providing critical insights into model performance in\nthe context of Bangla text comprehension.\n","authors":["Abdullah Khondoker","Enam Ahmed Taufik","Md Iftekhar Islam Tashik","S M Ishtiak mahmud","Antara Firoz Parsa"],"pdf_url":"https://arxiv.org/pdf/2412.18440v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18431v1","updated":"2024-12-24T13:45:22Z","published":"2024-12-24T13:45:22Z","title":"GeAR: Graph-enhanced Agent for Retrieval-augmented Generation","summary":" Retrieval-augmented generation systems rely on effective document retrieval\ncapabilities. By design, conventional sparse or dense retrievers face\nchallenges in multi-hop retrieval scenarios. In this paper, we present GeAR,\nwhich advances RAG performance through two key innovations: (i) graph\nexpansion, which enhances any conventional base retriever, such as BM25, and\n(ii) an agent framework that incorporates graph expansion. Our evaluation\ndemonstrates GeAR's superior retrieval performance on three multi-hop question\nanswering datasets. Additionally, our system achieves state-of-the-art results\nwith improvements exceeding 10% on the challenging MuSiQue dataset, while\nrequiring fewer tokens and iterations compared to other multi-step retrieval\nsystems.\n","authors":["Zhili Shen","Chenxin Diao","Pavlos Vougiouklis","Pascual Merita","Shriram Piramanayagam","Damien Graux","Dandan Tu","Zeren Jiang","Ruofei Lai","Yang Ren","Jeff Z. Pan"],"pdf_url":"https://arxiv.org/pdf/2412.18431v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18428v1","updated":"2024-12-24T13:42:44Z","published":"2024-12-24T13:42:44Z","title":"Explainable Multi-Modal Data Exploration in Natural Language via LLM\n Agent","summary":" International enterprises, organizations, or hospitals collect large amounts\nof multi-modal data stored in databases, text documents, images, and videos.\nWhile there has been recent progress in the separate fields of multi-modal data\nexploration as well as in database systems that automatically translate natural\nlanguage questions to database query languages, the research challenge of\nquerying database systems combined with other unstructured modalities such as\nimages in natural language is widely unexplored.\n In this paper, we propose XMODE - a system that enables explainable,\nmulti-modal data exploration in natural language. Our approach is based on the\nfollowing research contributions: (1) Our system is inspired by a real-world\nuse case that enables users to explore multi-modal information systems. (2)\nXMODE leverages a LLM-based agentic AI framework to decompose a natural\nlanguage question into subtasks such as text-to-SQL generation and image\nanalysis. (3) Experimental results on multi-modal datasets over relational data\nand images demonstrate that our system outperforms state-of-the-art multi-modal\nexploration systems, excelling not only in accuracy but also in various\nperformance metrics such as query latency, API costs, planning efficiency, and\nexplanation quality, thanks to the more effective utilization of the reasoning\ncapabilities of LLMs.\n","authors":["Farhad Nooralahzadeh","Yi Zhang","Jonathan Furst","Kurt Stockinger"],"pdf_url":"https://arxiv.org/pdf/2412.18428v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18424v1","updated":"2024-12-24T13:39:32Z","published":"2024-12-24T13:39:32Z","title":"LongDocURL: a Comprehensive Multimodal Long Document Benchmark\n Integrating Understanding, Reasoning, and Locating","summary":" Large vision language models (LVLMs) have improved the document understanding\ncapabilities remarkably, enabling the handling of complex document elements,\nlonger contexts, and a wider range of tasks. However, existing document\nunderstanding benchmarks have been limited to handling only a small number of\npages and fail to provide a comprehensive analysis of layout elements locating.\nIn this paper, we first define three primary task categories: Long Document\nUnderstanding, numerical Reasoning, and cross-element Locating, and then\npropose a comprehensive benchmark, LongDocURL, integrating above three primary\ntasks and comprising 20 sub-tasks categorized based on different primary tasks\nand answer evidences. Furthermore, we develop a semi-automated construction\npipeline and collect 2,325 high-quality question-answering pairs, covering more\nthan 33,000 pages of documents, significantly outperforming existing\nbenchmarks. Subsequently, we conduct comprehensive evaluation experiments on\nboth open-source and closed-source models across 26 different configurations,\nrevealing critical performance gaps in this field.\n","authors":["Chao Deng","Jiale Yuan","Pi Bu","Peijie Wang","Zhong-Zhi Li","Jian Xu","Xiao-Hui Li","Yuan Gao","Jun Song","Bo Zheng","Cheng-Lin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18424v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08309v3","updated":"2024-12-24T13:37:49Z","published":"2024-02-13T09:12:55Z","title":"Prompted Contextual Vectors for Spear-Phishing Detection","summary":" Spear-phishing attacks present a significant security challenge, with large\nlanguage models (LLMs) escalating the threat by generating convincing emails\nand facilitating target reconnaissance. To address this, we propose a detection\napproach based on a novel document vectorization method that utilizes an\nensemble of LLMs to create representation vectors. By prompting LLMs to reason\nand respond to human-crafted questions, we quantify the presence of common\npersuasion principles in the email's content, producing prompted contextual\ndocument vectors for a downstream supervised machine learning model. We\nevaluate our method using a unique dataset generated by a proprietary system\nthat automates target reconnaissance and spear-phishing email creation. Our\nmethod achieves a 91\\% F1 score in identifying LLM-generated spear-phishing\nemails, with the training set comprising only traditional phishing and benign\nemails. Key contributions include a novel document vectorization method\nutilizing LLM reasoning, a publicly available dataset of high-quality\nspear-phishing emails, and the demonstrated effectiveness of our method in\ndetecting such emails. This methodology can be utilized for various document\nclassification tasks, particularly in adversarial problem domains.\n","authors":["Daniel Nahmias","Gal Engelberg","Dan Klein","Asaf Shabtai"],"pdf_url":"https://arxiv.org/pdf/2402.08309v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15471v2","updated":"2024-12-24T13:33:51Z","published":"2024-12-20T00:56:13Z","title":"A Review of the Marathi Natural Language Processing","summary":" Marathi is one of the most widely used languages in the world. One might\nexpect that the latest advances in NLP research in languages like English reach\nsuch a large community. However, NLP advancements in English didn't immediately\nreach Indian languages like Marathi. There were several reasons for this. They\nincluded diversity of scripts used, lack of (publicly available) resources like\ntokenization strategies, high quality datasets \\& benchmarks, and evaluation\nmetrics. In addition to this, the morphologically rich nature of Marathi, made\nNLP tasks challenging. Advances in Neural Network (NN) based models and tools\nsince the early 2000s helped improve this situation and make NLP research more\naccessible. In the past 10 years, significant efforts were made to improve\nlanguage resources for all 22 scheduled languages of India. This paper presents\na broad overview of evolution of NLP research in Indic languages with a focus\non Marathi and state-of-the-art resources and tools available to the research\ncommunity. It also provides an overview of tools \\& techniques associated with\nMarathi NLP tasks.\n","authors":["Asang Dani","Shailesh R Sathe"],"pdf_url":"https://arxiv.org/pdf/2412.15471v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07019v2","updated":"2024-12-24T13:26:51Z","published":"2024-11-11T14:22:42Z","title":"UniHR: Hierarchical Representation Learning for Unified Knowledge Graph\n Link Prediction","summary":" Beyond-triple fact representations including hyper-relational facts with\nauxiliary key-value pairs, temporal facts with additional timestamps, and\nnested facts implying relationships between facts, are gaining significant\nattention. However, existing link prediction models are usually designed for\none specific type of facts, making it difficult to generalize to other fact\nrepresentations. To overcome this limitation, we propose a Unified Hierarchical\nRepresentation learning framework (UniHR) for unified knowledge graph link\nprediction. It consists of a unified Hierarchical Data Representation (HiDR)\nmodule and a unified Hierarchical Structure Learning (HiSL) module as graph\nencoder. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested\nfactual KGs into triple-based representations. Then HiSL incorporates\nintra-fact and inter-fact message passing, focusing on enhancing the semantic\ninformation within individual facts and enriching the structural information\nbetween facts. Experimental results across 7 datasets from 3 types of KGs\ndemonstrate that our UniHR outperforms baselines designed for one specific kind\nof KG, indicating strong generalization capability of HiDR form and the\neffectiveness of HiSL module. Code and data are available at\nhttps://github.com/Lza12a/UniHR.\n","authors":["Zhiqiang Liu","Mingyang Chen","Yin Hua","Zhuo Chen","Ziqi Liu","Lei Liang","Huajun Chen","Wen Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.07019v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.04739v2","updated":"2024-12-24T13:18:49Z","published":"2024-10-07T04:15:02Z","title":"TableRAG: Million-Token Table Understanding with Language Models","summary":" Recent advancements in language models (LMs) have notably enhanced their\nability to reason with tabular data, primarily through program-aided mechanisms\nthat manipulate and analyze tables. However, these methods often require the\nentire table as input, leading to scalability challenges due to the positional\nbias or context length constraints. In response to these challenges, we\nintroduce TableRAG, a Retrieval-Augmented Generation (RAG) framework\nspecifically designed for LM-based table understanding. TableRAG leverages\nquery expansion combined with schema and cell retrieval to pinpoint crucial\ninformation before providing it to the LMs. This enables more efficient data\nencoding and precise retrieval, significantly reducing prompt lengths and\nmitigating information loss. We have developed two new million-token benchmarks\nfrom the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's\neffectiveness at scale. Our results demonstrate that TableRAG's retrieval\ndesign achieves the highest retrieval quality, leading to the new\nstate-of-the-art performance on large-scale table understanding.\n","authors":["Si-An Chen","Lesly Miculicich","Julian Martin Eisenschlos","Zifeng Wang","Zilong Wang","Yanfei Chen","Yasuhisa Fujii","Hsuan-Tien Lin","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2410.04739v2.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.18415v1","updated":"2024-12-24T13:07:29Z","published":"2024-12-24T13:07:29Z","title":"Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi\n and English","summary":" Large Language Models (LLMs) excel in linguistic tasks but struggle with\nmathematical reasoning, particularly in non English languages like Hindi. This\nresearch aims to enhance the mathematical reasoning skills of smaller, resource\nefficient open-source LLMs in both Hindi and English. We evaluate models like\nOpenHathi 7B, LLaMA-2 7B, WizardMath 7B, Mistral 7B, LLeMMa 7B, MAmmoTH 7B,\nGemini Pro, and GPT-4 using zero-shot, few-shot chain-of-thought (CoT) methods,\nand supervised fine-tuning. Our approach incorporates curriculum learning,\nprogressively training models on increasingly difficult problems, a novel\nDecomposition Strategy to simplify complex arithmetic operations, and a\nStructured Solution Design that divides solutions into phases. Our experiments\nresult in notable performance enhancements. WizardMath 7B exceeds Gemini's\naccuracy on English datasets by +6% and matches Gemini's performance on Hindi\ndatasets. Adopting a bilingual approach that combines English and Hindi samples\nachieves results comparable to individual language models, demonstrating the\ncapability to learn mathematical reasoning in both languages. This research\nhighlights the potential for improving mathematical reasoning in open-source\nLLMs.\n","authors":["Avinash Anand","Kritarth Prasad","Chhavi Kirtani","Ashwin R Nair","Manvendra Kumar Nema","Raj Jaiswal","Rajiv Ratn Shah"],"pdf_url":"https://arxiv.org/pdf/2412.18415v1.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2412.16187v2","updated":"2024-12-24T13:04:45Z","published":"2024-12-13T06:00:27Z","title":"HashEvict: A Pre-Attention KV Cache Eviction Strategy using\n Locality-Sensitive Hashing","summary":" Transformer-based large language models (LLMs) use the key-value (KV) cache\nto significantly accelerate inference by storing the key and value embeddings\nof past tokens. However, this cache consumes significant GPU memory. In this\nwork, we introduce HashEvict, an algorithm that uses locality-sensitive hashing\n(LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache\nthat are cosine dissimilar to the current query token. This is achieved by\ncomputing the Hamming distance between binarized Gaussian projections of the\ncurrent token query and cached token keys, with a projection length much\nsmaller than the embedding dimension. We maintain a lightweight binary\nstructure in GPU memory to facilitate these calculations. Unlike existing\ncompression strategies that compute attention to determine token retention,\nHashEvict makes these decisions pre-attention, thereby reducing computational\ncosts. Additionally, HashEvict is dynamic - at every decoding step, the key and\nvalue of the current token replace the embeddings of a token expected to\nproduce the lowest attention score. We demonstrate that HashEvict can compress\nthe KV cache by 30%-70% while maintaining high performance across reasoning,\nmultiple-choice, long-context retrieval and summarization tasks.\n","authors":["Minghui Liu","Tahseen Rabbani","Tony O'Halloran","Ananth Sankaralingam","Mary-Anne Hartley","Brian Gravelle","Furong Huang","Cornelia Fermüller","Yiannis Aloimonos"],"pdf_url":"https://arxiv.org/pdf/2412.16187v2.pdf","comment":"10 pages, 6 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.18377v1","updated":"2024-12-24T12:03:36Z","published":"2024-12-24T12:03:36Z","title":"ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with\n LLM-based Chatbots","summary":" The rise of LLMs has deflected a growing portion of human-computer\ninteractions towards LLM-based chatbots. The remarkable abilities of these\nmodels allow users to interact using long, diverse natural language text\ncovering a wide range of topics and styles. Phrasing these messages is a time\nand effort consuming task, calling for an autocomplete solution to assist\nusers. We introduce the task of chatbot interaction autocomplete. We present\nChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework\nfor LLM-based chatbot interactions. The framework includes a formal definition\nof the task, coupled with suitable datasets and metrics. We use the framework\nto evaluate After formally defining the task along with suitable datasets and\nmetrics, we test 9 models on the defined auto completion task, finding that\nwhile current off-the-shelf models perform fairly, there is still much room for\nimprovement, mainly in ranking of the generated suggestions. We provide\ninsights for practitioners working on this task and open new research\ndirections for researchers in the field. We release our framework to serve as a\nfoundation for future research.\n","authors":["Shani Goren","Oren Kalinsky","Tomer Stav","Yuri Rapoport","Yaron Fairstein","Ram Yazdy","Nachshon Cohen","Alexander Libov","Guy Kushilevitz"],"pdf_url":"https://arxiv.org/pdf/2412.18377v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18376v1","updated":"2024-12-24T12:02:43Z","published":"2024-12-24T12:02:43Z","title":"Bidirectional Topic Matching: Quantifying Thematic Overlap Between\n Corpora Through Topic Modelling","summary":" This study introduces Bidirectional Topic Matching (BTM), a novel method for\ncross-corpus topic modeling that quantifies thematic overlap and divergence\nbetween corpora. BTM is a flexible framework that can incorporate various topic\nmodeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet\nAllocation (LDA). BTM employs a dual-model approach, training separate topic\nmodels for each corpus and applying them reciprocally to enable comprehensive\ncross-corpus comparisons. This methodology facilitates the identification of\nshared themes and unique topics, providing nuanced insights into thematic\nrelationships. Validation against cosine similarity-based methods demonstrates\nthe robustness of BTM, with strong agreement metrics and distinct advantages in\nhandling outlier topics. A case study on climate news articles showcases BTM's\nutility, revealing significant thematic overlaps and distinctions between\ncorpora focused on climate change and climate action. BTM's flexibility and\nprecision make it a valuable tool for diverse applications, from political\ndiscourse analysis to interdisciplinary studies. By integrating shared and\nunique topic analyses, BTM offers a comprehensive framework for exploring\nthematic relationships, with potential extensions to multilingual and dynamic\ndatasets. This work highlights BTM's methodological contributions and its\ncapacity to advance discourse analysis across various domains.\n","authors":["Raven Adam","Marie Lisa Kogler"],"pdf_url":"https://arxiv.org/pdf/2412.18376v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.18367v1","updated":"2024-12-24T11:50:18Z","published":"2024-12-24T11:50:18Z","title":"Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology\n Dataset","summary":" The field of machine translation has achieved significant advancements, yet\ndomain-specific terminology translation, particularly in AI, remains\nchallenging. We introduced GIST, a large-scale multilingual AI terminology\ndataset containing 5K terms extracted from top AI conference papers spanning\n2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese,\nand Russian using a hybrid framework that combines LLMs for extraction with\nhuman expertise for translation. The dataset's quality was benchmarked against\nexisting resources, demonstrating superior translation accuracy through\ncrowdsourced evaluation. GIST was integrated into translation workflows using\npost-translation refinement methods that required no retraining, where LLM\nprompting consistently improved BLEU and COMET scores. A web demonstration on\nthe ACL Anthology platform highlights its practical application, showcasing\nimproved accessibility for non-English speakers. This work aims to address\ncritical gaps in AI terminology resources and fosters global inclusivity and\ncollaboration in AI research.\n","authors":["Jiarui Liu","Iman Ouzzani","Wenkai Li","Lechen Zhang","Tianyue Ou","Houda Bouamor","Zhijing Jin","Mona Diab"],"pdf_url":"https://arxiv.org/pdf/2412.18367v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18364v1","updated":"2024-12-24T11:48:16Z","published":"2024-12-24T11:48:16Z","title":"Extracting triples from dialogues for conversational social agents","summary":" Obtaining an explicit understanding of communication within a Hybrid\nIntelligence collaboration is essential to create controllable and transparent\nagents. In this paper, we describe a number of Natural Language Understanding\nmodels that extract explicit symbolic triples from social conversation. Triple\nextraction has mostly been developed and tested for Knowledge Base Completion\nusing Wikipedia text and data for training and testing. However, social\nconversation is very different as a genre in which interlocutors exchange\ninformation in sequences of utterances that involve statements, questions, and\nanswers. Phenomena such as co-reference, ellipsis, coordination, and implicit\nand explicit negation or confirmation are more prominent in conversation than\nin Wikipedia text. We therefore describe an attempt to fill this gap by\nreleasing data sets for training and testing triple extraction from social\nconversation. We also created five triple extraction models and tested them in\nour evaluation data. The highest precision is 51.14 for complete triples and\n69.32 for triple elements when tested on single utterances. However, scores for\nconversational triples that span multiple turns are much lower, showing that\nextracting knowledge from true conversational data is much more challenging.\n","authors":["Piek Vossen","Selene Báez Santamaría","Lenka Bajčetić","Thomas Belluci"],"pdf_url":"https://arxiv.org/pdf/2412.18364v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16964v2","updated":"2024-12-24T11:43:25Z","published":"2024-12-22T10:49:27Z","title":"System-2 Mathematical Reasoning via Enriched Instruction Tuning","summary":" Solving complex mathematical problems via system-2 reasoning is a natural\nhuman skill, yet it remains a significant challenge for current large language\nmodels (LLMs). We identify the scarcity of deliberate multi-step reasoning data\nas a primary limiting factor. To this end, we introduce Enriched Instruction\nTuning (EIT), a method that enriches existing human-annotated mathematical\ndatasets by synergizing human and AI feedback to create fine-grained reasoning\ntrajectories. These datasets are then used to fine-tune open-source LLMs,\nenhancing their mathematical reasoning abilities without reliance on any\nsymbolic verification program. Concretely, EIT is composed of two critical\nsteps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step\n(ERS). The former generates a high-level plan that breaks down complex\ninstructions into a sequence of simpler objectives, while ERS fills in\nreasoning contexts often overlooked by human annotators, creating a smoother\nreasoning trajectory for LLM fine-tuning. Unlike existing CoT prompting methods\nthat generate reasoning chains only depending on LLM's internal knowledge, our\nmethod leverages human-annotated initial answers as ``meta-knowledge'' to help\nLLMs generate more detailed and precise reasoning processes, leading to a more\ntrustworthy LLM expert for complex mathematical problems. In experiments, EIT\nachieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing\nstate-of-the-art fine-tuning and prompting methods, and even matching the\nperformance of tool-augmented methods.\n","authors":["Huanqia Cai","Yijun Yang","Zhifeng Li"],"pdf_url":"https://arxiv.org/pdf/2412.16964v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18351v1","updated":"2024-12-24T11:24:56Z","published":"2024-12-24T11:24:56Z","title":"Multi-Agents Based on Large Language Models for Knowledge-based Visual\n Question Answering","summary":" Large Language Models (LLMs) have achieved impressive results in\nknowledge-based Visual Question Answering (VQA). However existing methods still\nhave challenges: the inability to use external tools autonomously, and the\ninability to work in teams. Humans tend to know whether they need to use\nexternal tools when they encounter a new question, e.g., they tend to be able\nto give a direct answer to a familiar question, whereas they tend to use tools\nsuch as search engines when they encounter an unfamiliar question. In addition,\nhumans also tend to collaborate and discuss with others to get better answers.\nInspired by this, we propose the multi-agent voting framework. We design three\nLLM-based agents that simulate different levels of staff in a team, and assign\nthe available tools according to the levels. Each agent provides the\ncorresponding answer, and finally all the answers provided by the agents are\nvoted to get the final answer. Experiments on OK-VQA and A-OKVQA show that our\napproach outperforms other baselines by 2.2 and 1.0, respectively.\n","authors":["Zhongjian Hu","Peng Yang","Bing Li","Zhenqi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18351v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17690v2","updated":"2024-12-24T11:03:42Z","published":"2024-12-23T16:16:30Z","title":"RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF\n for Conversational QA over KGs with RAG","summary":" Conversational question answering (ConvQA) is a convenient means of searching\nover RDF knowledge graphs (KGs), where a prevalent approach is to translate\nnatural language questions to SPARQL queries. However, SPARQL has certain\nshortcomings: (i) it is brittle for complex intents and conversational\nquestions, and (ii) it is not suitable for more abstract needs. Instead, we\npropose a novel two-pronged system where we fuse: (i) SQL-query results over a\ndatabase automatically derived from the KG, and (ii) text-search results over\nverbalizations of KG facts. Our pipeline supports iterative retrieval: when the\nresults of any branch are found to be unsatisfactory, the system can\nautomatically opt for further rounds. We put everything together in a retrieval\naugmented generation (RAG) setup, where an LLM generates a coherent response\nfrom accumulated search results. We demonstrate the superiority of our proposed\nsystem over several baselines on a knowledge graph of BMW automobiles.\n","authors":["Rishiraj Saha Roy","Chris Hinze","Joel Schlotthauer","Farzad Naderi","Viktor Hangya","Andreas Foltyn","Luzian Hahn","Fabian Kuech"],"pdf_url":"https://arxiv.org/pdf/2412.17690v2.pdf","comment":"Accepted at BTW 2025, 10 pages"},{"id":"http://arxiv.org/abs/2411.15364v2","updated":"2024-12-24T10:57:49Z","published":"2024-11-22T22:13:40Z","title":"Exploring Facets of Language Generation in the Limit","summary":" The recent work of Kleinberg & Mullainathan [KM24] provides a concrete model\nfor language generation in the limit: given a sequence of examples from an\nunknown target language, the goal is to generate new examples from the target\nlanguage such that no incorrect examples are generated beyond some point. In\nsharp contrast to strong negative results for the closely related problem of\nlanguage identification, they establish positive results for language\ngeneration in the limit for all countable collections of languages. Follow-up\nwork by Raman & Tewari [RT24] studies bounds on the number of distinct inputs\nrequired by an algorithm before correct language generation is achieved --\nnamely, whether this is a constant for all languages in the collection (uniform\ngeneration) or a language-dependent constant (non-uniform generation).\n We show that every countable language collection has a generator which has\nthe stronger property of non-uniform generation in the limit. However, while\nthe generation algorithm of [KM24] can be implemented using membership queries,\nwe show that any algorithm cannot non-uniformly generate even for collections\nof just two languages, using only membership queries.\n We also formalize the tension between validity and breadth in the generation\nalgorithm of [KM24] by introducing a definition of exhaustive generation, and\nshow a strong negative result for exhaustive generation. Our result shows that\na tradeoff between validity and breadth is inherent for generation in the\nlimit. We also provide a precise characterization of the language collections\nfor which exhaustive generation is possible. Finally, inspired by algorithms\nthat can choose to obtain feedback, we consider a model of uniform generation\nwith feedback, completely characterizing language collections for which such\nuniform generation with feedback is possible in terms of a complexity measure\nof the collection.\n","authors":["Moses Charikar","Chirag Pabbaraju"],"pdf_url":"https://arxiv.org/pdf/2411.15364v2.pdf","comment":"31 pages. Fixed typos, updated related work, added results on\n characterization of exhaustive generation"},{"id":"http://arxiv.org/abs/2412.15529v2","updated":"2024-12-24T10:32:13Z","published":"2024-12-20T03:37:07Z","title":"XRAG: eXamining the Core -- Benchmarking Foundational Components in\n Advanced Retrieval-Augmented Generation","summary":" Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent\ndata with the generative capabilities of Large Language Models (LLMs), ensuring\nthat the generated output is not only contextually relevant but also accurate\nand current. We introduce XRAG, an open-source, modular codebase that\nfacilitates exhaustive evaluation of the performance of foundational components\nof advanced RAG modules. These components are systematically categorized into\nfour core phases: pre-retrieval, retrieval, post-retrieval, and generation. We\nsystematically analyse them across reconfigured datasets, providing a\ncomprehensive benchmark for their effectiveness. As the complexity of RAG\nsystems continues to escalate, we underscore the critical need to identify\npotential failure points in RAG systems. We formulate a suite of experimental\nmethodologies and diagnostic testing protocols to dissect the failure points\ninherent in RAG engineering. Subsequently, we proffer bespoke solutions aimed\nat bolstering the overall performance of these modules. Our work thoroughly\nevaluates the performance of advanced core components in RAG systems, providing\ninsights into optimizations for prevalent failure points.\n","authors":["Qianren Mao","Yangyifei Luo","Jinlong Zhang","Hanwen Hao","Zhilong Cao","Xiaolong Wang","Xiao Guan","Zhenting Huang","Weifeng Jiang","Shuyu Guo","Zhentao Han","Qili Zhang","Siyuan Tao","Yujie Liu","Junnan Liu","Zhixing Tan","Jie Sun","Bo Li","Xudong Liu","Richong Zhang","Jianxin Li"],"pdf_url":"https://arxiv.org/pdf/2412.15529v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.06094v2","updated":"2024-12-24T09:46:41Z","published":"2024-10-08T14:49:41Z","title":"Listening to Patients: A Framework of Detecting and Mitigating Patient\n Misreport for Medical Dialogue Generation","summary":" Medical Dialogue Systems aim to provide automated healthcare support through\npatient-agent conversations. Previous efforts typically regard patients as\nideal users -- one who accurately and consistently reports their health\nconditions. However, in reality, patients often misreport their symptoms,\nleading to discrepancies between their reports and actual health conditions.\nOverlooking patient misreport will affect the quality of healthcare\nconsultations provided by MDS. To address this issue, we argue that MDS should\n''listen to patients'' and tackle two key challenges: how to detect and\nmitigate patient misreport effectively. In this work, we propose PaMis, a\nframework of detecting and mitigating Patient Misreport for medical dialogue\ngeneration. PaMis first constructs dialogue entity graphs, then detects patient\nmisreport based on graph entropy, and mitigates patient misreport by\nformulating clarifying questions. Experiments indicate that PaMis effectively\nenhances medical response generation, enabling models like GPT-4 to detect and\nmitigate patient misreports, and provide high-quality healthcare assistance.\n","authors":["Lang Qin","Yao Zhang","Hongru Liang","Adam Jatowt","Zhenglu Yang"],"pdf_url":"https://arxiv.org/pdf/2410.06094v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18299v1","updated":"2024-12-24T09:06:58Z","published":"2024-12-24T09:06:58Z","title":"M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models","summary":" With the widespread application of Large Language Models (LLMs) in the field\nof Natural Language Processing (NLP), enhancing their performance has become a\nresearch hotspot. This paper presents a novel multi-prompt ensemble decoding\napproach designed to bolster the generation quality of LLMs by leveraging the\naggregation of outcomes from multiple prompts. Given a unique input $X$, we\nsubmit $n$ variations of prompts with $X$ to LLMs in batch mode to decode and\nderive probability distributions. For each token prediction, we calculate the\nensemble probability by averaging the $n$ probability distributions within the\nbatch, utilizing this aggregated probability to generate the token. This\ntechnique is dubbed Inner-Batch Ensemble. To facilitate efficient batch\ninference, we implement a Left-Padding strategy to maintain uniform input\nlengths across the n prompts. Through extensive experimentation on diverse NLP\ntasks, including machine translation, code generation, and text simplification,\nwe demonstrate the efficacy of our method in enhancing LLM performance. The\nresults show substantial improvements in BLEU scores, pass@$k$ rates, and LENS\nmetrics over conventional methods.\n","authors":["Jiaxin Guo","Daimeng Wei","Yuanchang Luo","Shimin Tao","Hengchao Shang","Zongyao Li","Shaojun Li","Jinlong Yang","Zhanglin Wu","Zhiqiang Rao","Hao Yang"],"pdf_url":"https://arxiv.org/pdf/2412.18299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11465v3","updated":"2024-12-24T09:03:02Z","published":"2024-11-18T10:58:46Z","title":"Re-examining learning linear functions in context","summary":" In-context learning (ICL) has emerged as a powerful paradigm for easily\nadapting Large Language Models (LLMs) to various tasks. However, our\nunderstanding of how ICL works remains limited. We explore a simple model of\nICL in a controlled setup with synthetic training data to investigate ICL of\nunivariate linear functions. We experiment with a range of GPT-2-like\ntransformer models trained from scratch. Our findings challenge the prevailing\nnarrative that transformers adopt algorithmic approaches like linear regression\nto learn a linear function in-context. These models fail to generalize beyond\ntheir training distribution, highlighting fundamental limitations in their\ncapacity to infer abstract task structures. Our experiments lead us to propose\na mathematically precise hypothesis of what the model might be learning.\n","authors":["Omar Naim","Guilhem Fouilhé","Nicholas Asher"],"pdf_url":"https://arxiv.org/pdf/2411.11465v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.12488v2","updated":"2024-12-24T08:58:16Z","published":"2023-08-24T01:17:16Z","title":"GPTEval: A Survey on Assessments of ChatGPT and GPT-4","summary":" The emergence of ChatGPT has generated much speculation in the press about\nits potential to disrupt social and economic systems. Its astonishing language\nability has aroused strong curiosity among scholars about its performance in\ndifferent domains. There have been many studies evaluating the ability of\nChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive\nreview summarizing the collective assessment findings is lacking. The objective\nof this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4,\nfocusing on its language and reasoning abilities, scientific knowledge, and\nethical considerations. Furthermore, an examination of the existing evaluation\nmethods is conducted, offering several recommendations for future research in\nevaluating large language models.\n","authors":["Rui Mao","Guanyi Chen","Xulang Zhang","Frank Guerin","Erik Cambria"],"pdf_url":"https://arxiv.org/pdf/2308.12488v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18291v1","updated":"2024-12-24T08:53:54Z","published":"2024-12-24T08:53:54Z","title":"DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation","summary":" Code review is a vital but demanding aspect of software development,\ngenerating significant interest in automating review comments. Traditional\nevaluation methods for these comments, primarily based on text similarity, face\ntwo major challenges: inconsistent reliability of human-authored comments in\nopen-source projects and the weak correlation of text similarity with\nobjectives like enhancing code quality and detecting defects.\n This study empirically analyzes benchmark comments using a novel set of\ncriteria informed by prior research and developer interviews. We then similarly\nrevisit the evaluation of existing methodologies. Our evaluation framework,\nDeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a\ncomprehensive reassessment of current techniques based on the criteria set.\nBesides, we also introduce an innovative and efficient baseline, LLM-Reviewer,\nleveraging the few-shot learning capabilities of LLMs for a target-oriented\ncomparison.\n Our research highlights the limitations of text similarity metrics, finding\nthat less than 10% of benchmark comments are high quality for automation. In\ncontrast, DeepCRCEval effectively distinguishes between high and low-quality\ncomments, proving to be a more reliable evaluation mechanism. Incorporating LLM\nevaluators into DeepCRCEval significantly boosts efficiency, reducing time and\ncost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates\nsignificant potential of focusing task real targets in comment generation.\n","authors":["Junyi Lu","Xiaojia Li","Zihan Hua","Lei Yu","Shiqi Cheng","Li Yang","Fengjun Zhang","Chun Zuo"],"pdf_url":"https://arxiv.org/pdf/2412.18291v1.pdf","comment":"Accepted to the 28th International Conference on Fundamental\n Approaches to Software Engineering (FASE 2025), part of the 28th European\n Joint Conferences on Theory and Practice of Software (ETAPS 2025)"},{"id":"http://arxiv.org/abs/2412.18274v1","updated":"2024-12-24T08:33:44Z","published":"2024-12-24T08:33:44Z","title":"GenAI Content Detection Task 2: AI vs. Human -- Academic Essay\n Authenticity Challenge","summary":" This paper presents a comprehensive overview of the first edition of the\nAcademic Essay Authenticity Challenge, organized as part of the GenAI Content\nDetection shared tasks collocated with COLING 2025. This challenge focuses on\ndetecting machine-generated vs. human-authored essays for academic purposes.\nThe task is defined as follows: \"Given an essay, identify whether it is\ngenerated by a machine or authored by a human.'' The challenge involves two\nlanguages: English and Arabic. During the evaluation phase, 25 teams submitted\nsystems for English and 21 teams for Arabic, reflecting substantial interest in\nthe task. Finally, seven teams submitted system description papers. The\nmajority of submissions utilized fine-tuned transformer-based models, with one\nteam employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This\npaper outlines the task formulation, details the dataset construction process,\nand explains the evaluation framework. Additionally, we present a summary of\nthe approaches adopted by participating teams. Nearly all submitted systems\noutperformed the n-gram-based baseline, with the top-performing systems\nachieving F1 scores exceeding 0.98 for both languages, indicating significant\nprogress in the detection of machine-generated text.\n","authors":["Shammur Absar Chowdhury","Hind Almerekhi","Mucahid Kutlu","Kaan Efe Keles","Fatema Ahmad","Tasnim Mohiuddin","George Mikros","Firoj Alam"],"pdf_url":"https://arxiv.org/pdf/2412.18274v1.pdf","comment":"AI Generated Content, Academic Essay, LLMs, Arabic, English"},{"id":"http://arxiv.org/abs/2412.18260v1","updated":"2024-12-24T08:20:29Z","published":"2024-12-24T08:20:29Z","title":"Investigating Large Language Models for Code Vulnerability Detection: An\n Experimental Study","summary":" Code vulnerability detection (CVD) is essential for addressing and preventing\nsystem security issues, playing a crucial role in ensuring software security.\nPrevious learning-based vulnerability detection methods rely on either\nfine-tuning medium-size sequence models or training smaller neural networks\nfrom scratch. Recent advancements in large pre-trained language models (LLMs)\nhave showcased remarkable capabilities in various code intelligence tasks\nincluding code understanding and generation. However, the effectiveness of LLMs\nin detecting code vulnerabilities is largely under-explored. This work aims to\ninvestigate the gap by fine-tuning LLMs for the CVD task, involving four\nwidely-used open-source LLMs. We also implement other five previous graph-based\nor medium-size sequence models for comparison. Experiments are conducted on\nfive commonly-used CVD datasets, including both the part of short samples and\nlong samples. In addition, we conduct quantitative experiments to investigate\nthe class imbalance issue and the model's performance on samples of different\nlengths, which are rarely studied in previous works. To better facilitate\ncommunities, we open-source all codes and resources of this study in\nhttps://github.com/SakiRinn/LLM4CVD and\nhttps://huggingface.co/datasets/xuefen/VulResource.\n","authors":["Xuefeng Jiang","Lvhua Wu","Sheng Sun","Jia Li","Jingjing Xue","Yuwei Wang","Tingting Wu","Min Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18260v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2412.16576v2","updated":"2024-12-24T07:56:48Z","published":"2024-12-21T10:40:56Z","title":"Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive\n Learning with Dense Labeling","summary":" Growing labor shortages are increasing the demand for domestic service robots\n(DSRs) to assist in various settings. In this study, we develop a DSR that\ntransports everyday objects to specified pieces of furniture based on\nopen-vocabulary instructions. Our approach focuses on retrieving images of\ntarget objects and receptacles from pre-collected images of indoor\nenvironments. For example, given an instruction \"Please get the right red towel\nhanging on the metal towel rack and put it in the white washing machine on the\nleft,\" the DSR is expected to carry the red towel to the washing machine based\non the retrieved images. This is challenging because the correct images should\nbe retrieved from thousands of collected images, which may include many images\nof similar towels and appliances. To address this, we propose RelaX-Former,\nwhich learns diverse and robust representations from among positive, unlabeled\npositive, and negative samples. We evaluated RelaX-Former on a dataset\ncontaining real-world indoor images and human annotated instructions including\ncomplex referring expressions. The experimental results demonstrate that\nRelaX-Former outperformed existing baseline models across standard image\nretrieval metrics. Moreover, we performed physical experiments using a DSR to\nevaluate the performance of our approach in a zero-shot transfer setting. The\nexperiments involved the DSR to carry objects to specific receptacles based on\nopen-vocabulary instructions, achieving an overall success rate of 75%.\n","authors":["Daichi Yashima","Ryosuke Korekata","Komei Sugiura"],"pdf_url":"https://arxiv.org/pdf/2412.16576v2.pdf","comment":"Accepted for IEEE RA-L 2025"},{"id":"http://arxiv.org/abs/2411.02688v2","updated":"2024-12-24T07:47:02Z","published":"2024-11-05T00:16:01Z","title":"On the loss of context-awareness in general instruction fine-tuning","summary":" Pre-trained Large Language Models (LLMs) require post-training methods such\nas supervised fine-tuning (SFT) on instruction-response pairs to enable\ninstruction following. However, this process can potentially harm existing\ncapabilities learned during pre-training. In this paper, we investigate the\nloss of context awareness after SFT, where context awareness is defined as the\nability to extract and understand information from user-provided context and\nrespond accordingly. We are the first to identify and show that the loss of\ncontext awareness, as reflected by the performance drop in the\nNeedle-in-a-Haystack test, occurs in instruction fine-tuned LLMs when the chat\ntemplate is applied to input prompts. We identify that the performance decline\nis partially caused by an attention bias toward different roles learned during\nconversational instruction fine-tuning. We validate our hypothesis by\nvisualizing changes in attention allocation after the chat template is applied\nand manually steering the attention heads. Based on these observations, we\npropose a metric to select context-dependent examples from general instruction\nfine-tuning datasets. We then apply conditional instruction fine-tuning with a\ncontext-dependency indicator, enabling the model to learn context awareness\nfrom these selected examples. Empirical experiments on four context-dependent\ndownstream tasks and three pre-trained LLMs of different sizes show that our\nmethod effectively mitigates the loss of context awareness without compromising\ngeneral instruction-following capabilities. Given our findings, we strongly\nadvocate for careful benchmarking of context awareness after instruction\nfine-tuning.\n","authors":["Yihan Wang","Andrew Bai","Nanyun Peng","Cho-Jui Hsieh"],"pdf_url":"https://arxiv.org/pdf/2411.02688v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18216v1","updated":"2024-12-24T06:45:36Z","published":"2024-12-24T06:45:36Z","title":"ICM-Assistant: Instruction-tuning Multimodal Large Language Models for\n Rule-based Explainable Image Content Moderation","summary":" Controversial contents largely inundate the Internet, infringing various\ncultural norms and child protection standards. Traditional Image Content\nModeration (ICM) models fall short in producing precise moderation decisions\nfor diverse standards, while recent multimodal large language models (MLLMs),\nwhen adopted to general rule-based ICM, often produce classification and\nexplanation results that are inconsistent with human moderators. Aiming at\nflexible, explainable, and accurate ICM, we design a novel rule-based dataset\ngeneration pipeline, decomposing concise human-defined rules and leveraging\nwell-designed multi-stage prompts to enrich short explicit image annotations.\nOur ICM-Instruct dataset includes detailed moderation explanation and\nmoderation Q-A pairs. Built upon it, we create our ICM-Assistant model in the\nframework of rule-based ICM, making it readily applicable in real practice. Our\nICM-Assistant model demonstrates exceptional performance and flexibility.\nSpecifically, it significantly outperforms existing approaches on various\nsources, improving both the moderation classification (36.8\\% on average) and\nmoderation explanation quality (26.6\\% on average) consistently over existing\nMLLMs. Code/Data is available at https://github.com/zhaoyuzhi/ICM-Assistant.\n","authors":["Mengyang Wu","Yuzhi Zhao","Jialun Cao","Mingjie Xu","Zhongming Jiang","Xuehui Wang","Qinbin Li","Guangneng Hu","Shengchao Qin","Chi-Wing Fu"],"pdf_url":"https://arxiv.org/pdf/2412.18216v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2409.01787v2","updated":"2024-12-24T06:06:25Z","published":"2024-09-03T11:06:45Z","title":"LLM-GAN: Construct Generative Adversarial Network Through Large Language\n Models For Explainable Fake News Detection","summary":" Explainable fake news detection predicts the authenticity of news items with\nannotated explanations. Today, Large Language Models (LLMs) are known for their\npowerful natural language understanding and explanation generation abilities.\nHowever, presenting LLMs for explainable fake news detection remains two main\nchallenges. Firstly, fake news appears reasonable and could easily mislead\nLLMs, leaving them unable to understand the complex news-faking process.\nSecondly, utilizing LLMs for this task would generate both correct and\nincorrect explanations, which necessitates abundant labor in the loop. In this\npaper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms\nto enable an LLM to become Generator and Detector and for realistic fake news\ngeneration and detection. Our results demonstrate LLM-GAN's effectiveness in\nboth prediction performance and explanation quality. We further showcase the\nintegration of LLM-GAN to a cloud-native AI platform to provide better fake\nnews detection service in the cloud.\n","authors":["Yifeng Wang","Zhouhong Gu","Siwei Zhang","Suhang Zheng","Tao Wang","Tianyu Li","Hongwei Feng","Yanghua Xiao"],"pdf_url":"https://arxiv.org/pdf/2409.01787v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18196v1","updated":"2024-12-24T06:05:08Z","published":"2024-12-24T06:05:08Z","title":"Robustness-aware Automatic Prompt Optimization","summary":" The performance of Large Language Models (LLMs) is based on the quality of\nthe prompts and the semantic and structural integrity information of the input\ndata. However, current prompt generation methods primarily focus on generating\nprompts for clean input data, often overlooking the impact of perturbed inputs\non prompt performance. To address this limitation, we propose BATprompt (By\nAdversarial Training prompt), a novel method for prompt generation designed to\nwithstand input perturbations (such as typos in the input). Inspired by\nadversarial training techniques, BATprompt demonstrates strong performance on a\nvariety of perturbed tasks through a two-step process: adversarial perturbation\nand iterative optimization on unperturbed input via LLM. Unlike conventional\nadversarial attack methods, BATprompt avoids reliance on real gradients or\nmodel parameters. Instead, it leverages the advanced reasoning, language\nunderstanding and self reflection capabilities of LLMs to simulate gradients,\nguiding the generation of adversarial perturbations and optimizing prompt\nperformance. In our experiments, we evaluate BATprompt on multiple datasets\nacross both language understanding and generation tasks. The results indicate\nthat BATprompt outperforms existing prompt generation methods, delivering\nsuperior robustness and performance under diverse perturbation scenarios.\n","authors":["Zeru Shi","Zhenting Wang","Yongye Su","Weidi Luo","Fan Yang","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18196v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18194v1","updated":"2024-12-24T06:03:42Z","published":"2024-12-24T06:03:42Z","title":"VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics\n Manipulation with Long-Horizon Reasoning Tasks","summary":" General-purposed embodied agents are designed to understand the users'\nnatural instructions or intentions and act precisely to complete universal\ntasks. Recently, methods based on foundation models especially\nVision-Language-Action models (VLAs) have shown a substantial potential to\nsolve language-conditioned manipulation (LCM) tasks well. However, existing\nbenchmarks do not adequately meet the needs of VLAs and relative algorithms. To\nbetter define such general-purpose tasks in the context of LLMs and advance the\nresearch in VLAs, we present VLABench, an open-source benchmark for evaluating\nuniversal LCM task learning. VLABench provides 100 carefully designed\ncategories of tasks, with strong randomization in each category of task and a\ntotal of 2000+ objects. VLABench stands out from previous benchmarks in four\nkey aspects: 1) tasks requiring world knowledge and common sense transfer, 2)\nnatural language instructions with implicit human intentions rather than\ntemplates, 3) long-horizon tasks demanding multi-step reasoning, and 4)\nevaluation of both action policies and language model capabilities. The\nbenchmark assesses multiple competencies including understanding of\nmesh\\&texture, spatial relationship, semantic instruction, physical laws,\nknowledge transfer and reasoning, etc. To support the downstream finetuning, we\nprovide high-quality training data collected via an automated framework\nincorporating heuristic skills and prior information. The experimental results\nindicate that both the current state-of-the-art pretrained VLAs and the\nworkflow based on VLMs face challenges in our tasks.\n","authors":["Shiduo Zhang","Zhe Xu","Peiju Liu","Xiaopeng Yu","Yuan Li","Qinghui Gao","Zhaoye Fei","Zhangyue Yin","Zuxuan Wu","Yu-Gang Jiang","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2412.18194v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18190v1","updated":"2024-12-24T05:54:40Z","published":"2024-12-24T05:54:40Z","title":"An Analysis on Automated Metrics for Evaluating Japanese-English Chat\n Translation","summary":" This paper analyses how traditional baseline metrics, such as BLEU and TER,\nand neural-based methods, such as BERTScore and COMET, score several NMT models\nperformance on chat translation and how these metrics perform when compared to\nhuman-annotated scores. The results show that for ranking NMT models in chat\ntranslations, all metrics seem consistent in deciding which model outperforms\nthe others. This implies that traditional baseline metrics, which are faster\nand simpler to use, can still be helpful. On the other hand, when it comes to\nbetter correlation with human judgment, neural-based metrics outperform\ntraditional metrics, with COMET achieving the highest correlation with the\nhuman-annotated score on a chat translation. However, we show that even the\nbest metric struggles when scoring English translations from sentences with\nanaphoric zero-pronoun in Japanese.\n","authors":["Andre Rusli","Makoto Shishido"],"pdf_url":"https://arxiv.org/pdf/2412.18190v1.pdf","comment":"Accepted at the 29th Annual Meeting of the Association for Natural\n Language Processing (NLP2023). Published version available at\n https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/A8-1.pdf"},{"id":"http://arxiv.org/abs/2412.18188v1","updated":"2024-12-24T05:50:18Z","published":"2024-12-24T05:50:18Z","title":"On the Applicability of Zero-Shot Cross-Lingual Transfer Learning for\n Sentiment Classification in Distant Language Pairs","summary":" This research explores the applicability of cross-lingual transfer learning\nfrom English to Japanese and Indonesian using the XLM-R pre-trained model. The\nresults are compared with several previous works, either by models using a\nsimilar zero-shot approach or a fully-supervised approach, to provide an\noverview of the zero-shot transfer learning approach's capability using XLM-R\nin comparison with existing models. Our models achieve the best result in one\nJapanese dataset and comparable results in other datasets in Japanese and\nIndonesian languages without being trained using the target language.\nFurthermore, the results suggest that it is possible to train a multi-lingual\nmodel, instead of one model for each language, and achieve promising results.\n","authors":["Andre Rusli","Makoto Shishido"],"pdf_url":"https://arxiv.org/pdf/2412.18188v1.pdf","comment":"Accepted at the 28th Annual Meeting of the Association for Natural\n Language Processing (NLP2022). Published version available at\n https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/A6-1.pdf"},{"id":"http://arxiv.org/abs/2412.09014v4","updated":"2024-12-24T05:41:32Z","published":"2024-12-12T07:24:16Z","title":"Improvement in Sign Language Translation Using Text CTC Alignment","summary":" Current sign language translation (SLT) approaches often rely on gloss-based\nsupervision with Connectionist Temporal Classification (CTC), limiting their\nability to handle non-monotonic alignments between sign language video and\nspoken text. In this work, we propose a novel method combining joint\nCTC/Attention and transfer learning. The joint CTC/Attention introduces\nhierarchical encoding and integrates CTC with the attention mechanism during\ndecoding, effectively managing both monotonic and non-monotonic alignments.\nMeanwhile, transfer learning helps bridge the modality gap between vision and\nlanguage in SLT. Experimental results on two widely adopted benchmarks,\nRWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves\nresults comparable to state-of-the-art and outperforms the pure-attention\nbaseline. Additionally, this work opens a new door for future research into\ngloss-free SLT using text-based CTC alignment.\n","authors":["Sihan Tan","Taro Miyazaki","Nabeela Khan","Kazuhiro Nakadai"],"pdf_url":"https://arxiv.org/pdf/2412.09014v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18163v1","updated":"2024-12-24T04:51:32Z","published":"2024-12-24T04:51:32Z","title":"Survey of Pseudonymization, Abstractive Summarization & Spell Checker\n for Hindi and Marathi","summary":" India's vast linguistic diversity presents unique challenges and\nopportunities for technological advancement, especially in the realm of Natural\nLanguage Processing (NLP). While there has been significant progress in NLP\napplications for widely spoken languages, the regional languages of India, such\nas Marathi and Hindi, remain underserved. Research in the field of NLP for\nIndian regional languages is at a formative stage and holds immense\nsignificance. The paper aims to build a platform which enables the user to use\nvarious features like text anonymization, abstractive text summarization and\nspell checking in English, Hindi and Marathi language. The aim of these tools\nis to serve enterprise and consumer clients who predominantly use Indian\nRegional Languages.\n","authors":["Rasika Ransing","Mohammed Amaan Dhamaskar","Ayush Rajpurohit","Amey Dhoke","Sanket Dalvi"],"pdf_url":"https://arxiv.org/pdf/2412.18163v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15268v2","updated":"2024-12-24T04:38:57Z","published":"2024-12-17T06:28:28Z","title":"Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic\n Knowledge Graph","summary":" The rapid growth of social media platforms has raised significant concerns\nregarding online content toxicity. When Large Language Models (LLMs) are used\nfor toxicity detection, two key challenges emerge: 1) the absence of\ndomain-specific toxic knowledge leads to false negatives; 2) the excessive\nsensitivity of LLMs to toxic speech results in false positives, limiting\nfreedom of speech. To address these issues, we propose a novel method called\nMetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance\nhatred and toxicity detection. First, we construct a comprehensive meta-toxic\nknowledge graph by utilizing LLMs to extract toxic information through a\nthree-step pipeline, with toxic benchmark datasets serving as corpora. Second,\nwe query the graph via retrieval and ranking processes to supplement accurate,\nrelevant toxic knowledge. Extensive experiments and in-depth case studies\nacross multiple datasets demonstrate that our MetaTox significantly decreases\nthe false positive rate while boosting overall toxicity detection performance.\nOur code will be available soon.\n","authors":["Yibo Zhao","Jiapeng Zhu","Can Xu","Xiang Li"],"pdf_url":"https://arxiv.org/pdf/2412.15268v2.pdf","comment":"8 pages of content"},{"id":"http://arxiv.org/abs/2412.18156v1","updated":"2024-12-24T04:28:42Z","published":"2024-12-24T04:28:42Z","title":"scReader: Prompting Large Language Models to Interpret scRNA-seq Data","summary":" Large language models (LLMs) have demonstrated remarkable advancements,\nprimarily due to their capabilities in modeling the hidden relationships within\ntext sequences. This innovation presents a unique opportunity in the field of\nlife sciences, where vast collections of single-cell omics data from multiple\nspecies provide a foundation for training foundational models. However, the\nchallenge lies in the disparity of data scales across different species,\nhindering the development of a comprehensive model for interpreting genetic\ndata across diverse organisms. In this study, we propose an innovative hybrid\napproach that integrates the general knowledge capabilities of LLMs with\ndomain-specific representation models for single-cell omics data\ninterpretation. We begin by focusing on genes as the fundamental unit of\nrepresentation. Gene representations are initialized using functional\ndescriptions, leveraging the strengths of mature language models such as\nLLaMA-2. By inputting single-cell gene-level expression data with prompts, we\neffectively model cellular representations based on the differential expression\nlevels of genes across various species and cell types. In the experiments, we\nconstructed developmental cells from humans and mice, specifically targeting\ncells that are challenging to annotate. We evaluated our methodology through\nbasic tasks such as cell annotation and visualization analysis. The results\ndemonstrate the efficacy of our approach compared to other methods using LLMs,\nhighlighting significant improvements in accuracy and interoperability. Our\nhybrid approach enhances the representation of single-cell data and offers a\nrobust framework for future research in cross-species genetic analysis.\n","authors":["Cong Li","Qingqing Long","Yuanchun Zhou","Meng Xiao"],"pdf_url":"https://arxiv.org/pdf/2412.18156v1.pdf","comment":"8 pages, Accepted by ICDM 2024"},{"id":"http://arxiv.org/abs/2412.18154v1","updated":"2024-12-24T04:20:43Z","published":"2024-12-24T04:20:43Z","title":"GeneSUM: Large Language Model-based Gene Summary Extraction","summary":" Emerging topics in biomedical research are continuously expanding, providing\na wealth of information about genes and their function. This rapid\nproliferation of knowledge presents unprecedented opportunities for scientific\ndiscovery and formidable challenges for researchers striving to keep abreast of\nthe latest advancements. One significant challenge is navigating the vast\ncorpus of literature to extract vital gene-related information, a\ntime-consuming and cumbersome task. To enhance the efficiency of this process,\nit is crucial to address several key challenges: (1) the overwhelming volume of\nliterature, (2) the complexity of gene functions, and (3) the automated\nintegration and generation. In response, we propose GeneSUM, a two-stage\nautomated gene summary extractor utilizing a large language model (LLM). Our\napproach retrieves and eliminates redundancy of target gene literature and then\nfine-tunes the LLM to refine and streamline the summarization process. We\nconducted extensive experiments to validate the efficacy of our proposed\nframework. The results demonstrate that LLM significantly enhances the\nintegration of gene-specific information, allowing more efficient\ndecision-making in ongoing research.\n","authors":["Zhijian Chen","Chuan Hu","Min Wu","Qingqing Long","Xuezhi Wang","Yuanchun Zhou","Meng Xiao"],"pdf_url":"https://arxiv.org/pdf/2412.18154v1.pdf","comment":"7 pages, Accepted by BIBM 2024"},{"id":"http://arxiv.org/abs/2412.16642v2","updated":"2024-12-24T04:20:18Z","published":"2024-12-21T14:24:32Z","title":"L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text\n Compression","summary":" Learning-based probabilistic models can be combined with an entropy coder for\ndata compression. However, due to the high complexity of learning-based models,\ntheir practical application as text compressors has been largely overlooked. To\naddress this issue, our work focuses on a low-complexity design while\nmaintaining compression performance. We introduce a novel Learned Lossless\nLow-complexity Text Compression method (L3TC). Specifically, we conduct\nextensive experiments demonstrating that RWKV models achieve the fastest\ndecoding speed with a moderate compression ratio, making it the most suitable\nbackbone for our method. Second, we propose an outlier-aware tokenizer that\nuses a limited vocabulary to cover frequent tokens while allowing outliers to\nbypass the prediction and encoding. Third, we propose a novel high-rank\nreparameterization strategy that enhances the learning capability during\ntraining without increasing complexity during inference. Experimental results\nvalidate that our method achieves 48% bit saving compared to gzip compressor.\nBesides, L3TC offers compression performance comparable to other learned\ncompressors, with a 50x reduction in model parameters. More importantly, L3TC\nis the fastest among all learned compressors, providing real-time decoding\nspeeds up to megabytes per second. Our code is available at\nhttps://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.\n","authors":["Junxuan Zhang","Zhengxue Cheng","Yan Zhao","Shihao Wang","Dajiang Zhou","Guo Lu","Li Song"],"pdf_url":"https://arxiv.org/pdf/2412.16642v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18151v1","updated":"2024-12-24T04:09:33Z","published":"2024-12-24T04:09:33Z","title":"CoAM: Corpus of All-Type Multiword Expressions","summary":" Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.\nMWE identification, i.e., detecting MWEs in text, can play a key role in\ndownstream tasks such as machine translation. Existing datasets for MWE\nidentification are inconsistently annotated, limited to a single type of MWE,\nor limited in size. To enable reliable and comprehensive evaluation, we created\nCoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences\nconstructed through a multi-step process to enhance data quality consisting of\nhuman annotation, human review, and automated consistency checking. MWEs in\nCoAM are tagged with MWE types, such as Noun and Verb, to enable fine-grained\nerror analysis. Annotations for CoAM were collected using a new interface\ncreated with our interface generator, which allows easy and flexible annotation\nof MWEs in any form, including discontinuous ones. Through experiments using\nCoAM, we find that a fine-tuned large language model outperforms the current\nstate-of-the-art approach for MWE identification. Furthermore, analysis using\nour MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to\nidentify across approaches.\n","authors":["Yusuke Ide","Joshua Tanner","Adam Nohejl","Jacob Hoffman","Justin Vasselli","Hidetaka Kamigaito","Taro Watanabe"],"pdf_url":"https://arxiv.org/pdf/2412.18151v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16507v2","updated":"2024-12-24T04:08:22Z","published":"2024-12-21T07:06:44Z","title":"Adapting Whisper for Code-Switching through Encoding Refining and\n Language-Aware Decoding","summary":" Code-switching (CS) automatic speech recognition (ASR) faces challenges due\nto the language confusion resulting from accents, auditory similarity, and\nseamless language switches. Adaptation on the pre-trained multi-lingual model\nhas shown promising performance for CS-ASR. In this paper, we adapt Whisper,\nwhich is a large-scale multilingual pre-trained speech recognition model, to CS\nfrom both encoder and decoder parts. First, we propose an encoder refiner to\nenhance the encoder's capacity of intra-sentence swithching. Second, we propose\nusing two sets of language-aware adapters with different language prompt\nembeddings to achieve language-specific decoding information in each decoder\nlayer. Then, a fusion module is added to fuse the language-aware decoding. The\nexperimental results using the SEAME dataset show that, compared with the\nbaseline model, the proposed approach achieves a relative MER reduction of 4.1%\nand 7.2% on the dev_man and dev_sge test sets, respectively, surpassing\nstate-of-the-art methods. Through experiments, we found that the proposed\nmethod significantly improves the performance on non-native language in CS\nspeech, indicating that our approach enables Whisper to better distinguish\nbetween the two languages.\n","authors":["Jiahui Zhao","Hao Shi","Chenrui Cui","Tianrui Wang","Hexin Liu","Zhaoheng Ni","Lingxuan Ye","Longbiao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.16507v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18148v1","updated":"2024-12-24T04:04:54Z","published":"2024-12-24T04:04:54Z","title":"Are We in the AI-Generated Text World Already? Quantifying and\n Monitoring AIGT on Social Media","summary":" Social media platforms are experiencing a growing presence of AI-Generated\nTexts (AIGTs). However, the misuse of AIGTs could have profound implications\nfor public opinion, such as spreading misinformation and manipulating\nnarratives. Despite its importance, a systematic study to assess the prevalence\nof AIGTs on social media is still lacking. To address this gap, this paper aims\nto quantify, monitor, and analyze the AIGTs on online social media platforms.\nWe first collect a dataset (SM-D) with around 2.4M posts from 3 major social\nmedia platforms: Medium, Quora, and Reddit. Then, we construct a diverse\ndataset (AIGTBench) to train and evaluate AIGT detectors. AIGTBench combines\npopular open-source datasets and our AIGT datasets generated from social media\ntexts by 12 LLMs, serving as a benchmark for evaluating mainstream detectors.\nWith this setup, we identify the best-performing detector (OSM-Det). We then\napply OSM-Det to SM-D to track AIGTs over time and observe different trends of\nAI Attribution Rate (AAR) across social media platforms from January 2022 to\nOctober 2024. Specifically, Medium and Quora exhibit marked increases in AAR,\nrising from 1.77% to 37.03% and 2.06% to 38.95%, respectively. In contrast,\nReddit shows slower growth, with AAR increasing from 1.31% to 2.45% over the\nsame period. Our further analysis indicates that AIGTs differ from\nhuman-written texts across several dimensions, including linguistic patterns,\ntopic distributions, engagement levels, and the follower distribution of\nauthors. We envision our analysis and findings on AIGTs in social media can\nshed light on future research in this domain.\n","authors":["Zhen Sun","Zongmin Zhang","Xinyue Shen","Ziyi Zhang","Yule Liu","Michael Backes","Yang Zhang","Xinlei He"],"pdf_url":"https://arxiv.org/pdf/2412.18148v1.pdf","comment":"24 pages,18 figures"},{"id":"http://arxiv.org/abs/2410.07985v3","updated":"2024-12-24T04:04:30Z","published":"2024-10-10T14:39:33Z","title":"Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large\n Language Models","summary":" Recent advancements in large language models (LLMs) have led to significant\nbreakthroughs in mathematical reasoning capabilities. However, existing\nbenchmarks like GSM8K or MATH are now being solved with high accuracy (e.g.,\nOpenAI o1 achieves 94.8\\% on MATH dataset), indicating their inadequacy for\ntruly challenging these models. To bridge this gap, we propose a comprehensive\nand challenging benchmark specifically designed to assess LLMs' mathematical\nreasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks,\nour dataset focuses exclusively on mathematics and comprises a vast collection\nof 4428 competition-level problems with rigorous human annotation. These\nproblems are meticulously categorized into over 33 sub-domains and span more\nthan 10 distinct difficulty levels, enabling a holistic assessment of model\nperformance in Olympiad-mathematical reasoning. Furthermore, we conducted an\nin-depth analysis based on this benchmark. Our experimental results show that\neven the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle\nwith highly challenging Olympiad-level problems, with 60.54\\% and 52.55\\%\naccuracy, highlighting significant challenges in Olympiad-level mathematical\nreasoning.\n","authors":["Bofei Gao","Feifan Song","Zhe Yang","Zefan Cai","Yibo Miao","Qingxiu Dong","Lei Li","Chenghao Ma","Liang Chen","Runxin Xu","Zhengyang Tang","Benyou Wang","Daoguang Zan","Shanghaoran Quan","Ge Zhang","Lei Sha","Yichang Zhang","Xuancheng Ren","Tianyu Liu","Baobao Chang"],"pdf_url":"https://arxiv.org/pdf/2410.07985v3.pdf","comment":"30 pages"},{"id":"http://arxiv.org/abs/2401.06824v4","updated":"2024-12-24T04:03:45Z","published":"2024-01-12T00:50:04Z","title":"Revisiting Jailbreaking for Large Language Models: A Representation\n Engineering Perspective","summary":" The recent surge in jailbreaking attacks has revealed significant\nvulnerabilities in Large Language Models (LLMs) when exposed to malicious\ninputs. While various defense strategies have been proposed to mitigate these\nthreats, there has been limited research into the underlying mechanisms that\nmake LLMs vulnerable to such attacks. In this study, we suggest that the\nself-safeguarding capability of LLMs is linked to specific activity patterns\nwithin their representation space. Although these patterns have little impact\non the semantic content of the generated text, they play a crucial role in\nshaping LLM behavior under jailbreaking attacks. Our findings demonstrate that\nthese patterns can be detected with just a few pairs of contrastive queries.\nExtensive experimentation shows that the robustness of LLMs against\njailbreaking can be manipulated by weakening or strengthening these patterns.\nFurther visual analysis provides additional evidence for our conclusions,\nproviding new insights into the jailbreaking phenomenon. These findings\nhighlight the importance of addressing the potential misuse of open-source LLMs\nwithin the community.\n","authors":["Tianlong Li","Zhenghua Wang","Wenhao Liu","Muling Wu","Shihan Dou","Changze Lv","Xiaohua Wang","Xiaoqing Zheng","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2401.06824v4.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2412.18139v1","updated":"2024-12-24T03:50:03Z","published":"2024-12-24T03:50:03Z","title":"Ensuring Consistency for In-Image Translation","summary":" The in-image machine translation task involves translating text embedded\nwithin images, with the translated results presented in image format. While\nthis task has numerous applications in various scenarios such as film poster\ntranslation and everyday scene image translation, existing methods frequently\nneglect the aspect of consistency throughout this process. We propose the need\nto uphold two types of consistency in this task: translation consistency and\nimage generation consistency. The former entails incorporating image\ninformation during translation, while the latter involves maintaining\nconsistency between the style of the text-image and the original image,\nensuring background integrity. To address these consistency requirements, we\nintroduce a novel two-stage framework named HCIIT (High-Consistency In-Image\nTranslation) which involves text-image translation using a multimodal\nmultilingual large language model in the first stage and image backfilling with\na diffusion model in the second stage. Chain of thought learning is utilized in\nthe first stage to enhance the model's ability to leverage image information\nduring translation. Subsequently, a diffusion model trained for\nstyle-consistent text-image generation ensures uniformity in text style within\nimages and preserves background details. A dataset comprising 400,000\nstyle-consistent pseudo text-image pairs is curated for model training. Results\nobtained on both curated test sets and authentic image test sets validate the\neffectiveness of our framework in ensuring consistency and producing\nhigh-quality translated images.\n","authors":["Chengpeng Fu","Xiaocheng Feng","Yichong Huang","Wenshuai Huo","Baohang Li","Zhirui Zhang","Yunfei Lu","Dandan Tu","Duyu Tang","Hui Wang","Bing Qin","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18139v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18135v1","updated":"2024-12-24T03:43:15Z","published":"2024-12-24T03:43:15Z","title":"LSAQ: Layer-Specific Adaptive Quantization for Large Language Model\n Deployment","summary":" As large language models (LLMs) demonstrate exceptional performance across\nvarious domains, the deployment of these models on edge devices has emerged as\na new trend. Quantization techniques, which reduce the size and memory\nfootprint of LLMs, are effective for enabling deployment on\nresource-constrained edge devices. However, existing one-size-fits-all\nquantization methods often fail to dynamically adjust the memory consumption of\nLLMs based on specific hardware characteristics and usage scenarios. To address\nthis limitation, we propose LSAQ (Layer-Specific Adaptive Quantization), a\nsystem for adaptive quantization and dynamic deployment of LLMs based on layer\nimportance. LSAQ evaluates layer importance by constructing top-k token sets\nfrom the inputs and outputs of each layer and calculating their Jaccard\ncoefficient. Using this evaluation, the system adaptively adjusts quantization\nstrategies in real time according to the resource availability of edge devices,\nassigning different precision levels to layers of varying importance. This\napproach significantly reduces the storage requirements of LLMs while\nmaintaining model performance, enabling efficient deployment across diverse\nhardware platforms and usage scenarios.\n","authors":["Binrui Zeng","Bin Ji","Xiaodong Liu","Jie Yu","Shasha Li","Jun Ma","Xiaopeng Li","Shangwen Wang","Xinran Hong"],"pdf_url":"https://arxiv.org/pdf/2412.18135v1.pdf","comment":"8 pages, 4 figures, work in progress"},{"id":"http://arxiv.org/abs/2412.17701v2","updated":"2024-12-24T03:23:24Z","published":"2024-12-23T16:32:55Z","title":"From Models to Microtheories: Distilling a Model's Topical Knowledge for\n Grounded Question Answering","summary":" Recent reasoning methods (e.g., chain-of-thought, entailment reasoning) help\nusers understand how language models (LMs) answer a single question, but they\ndo little to reveal the LM's overall understanding, or \"theory,\" about the\nquestion's topic, making it still hard to trust the model. Our goal is to\nmaterialize such theories - here called microtheories (a linguistic analog of\nlogical microtheories) - as a set of sentences encapsulating an LM's core\nknowledge about a topic. These statements systematically work together to\nentail answers to a set of questions to both engender trust and improve\nperformance. Our approach is to first populate a knowledge store with\n(model-generated) sentences that entail answers to training questions and then\ndistill those down to a core microtheory that is concise, general, and\nnon-redundant. We show that, when added to a general corpus (e.g., Wikipedia),\nmicrotheories can supply critical, topical information not necessarily present\nin the corpus, improving both a model's ability to ground its answers to\nverifiable knowledge (i.e., show how answers are systematically entailed by\ndocuments in the corpus, fully grounding up to +8% more answers), and the\naccuracy of those grounded answers (up to +8% absolute). We also show that, in\na human evaluation in the medical domain, our distilled microtheories contain a\nsignificantly higher concentration of topically critical facts than the\nnon-distilled knowledge store. Finally, we show we can quantify the coverage of\na microtheory for a topic (characterized by a dataset) using a notion of\n$p$-relevance. Together, these suggest that microtheories are an efficient\ndistillation of an LM's topic-relevant knowledge, that they can usefully\naugment existing corpora, and can provide both performance gains and an\ninterpretable, verifiable window into the model's knowledge of a topic.\n","authors":["Nathaniel Weir","Bhavana Dalvi Mishra","Orion Weller","Oyvind Tafjord","Sam Hornstein","Alexander Sabol","Peter Jansen","Benjamin Van Durme","Peter Clark"],"pdf_url":"https://arxiv.org/pdf/2412.17701v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18123v1","updated":"2024-12-24T03:17:45Z","published":"2024-12-24T03:17:45Z","title":"AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image\n Models","summary":" As text-to-image (T2I) models continue to advance and gain widespread\nadoption, their associated safety issues are becoming increasingly prominent.\nMalicious users often exploit these models to generate Not-Safe-for-Work (NSFW)\nimages using harmful or adversarial prompts, highlighting the critical need for\nrobust safeguards to ensure the integrity and compliance of model outputs.\nCurrent internal safeguards frequently degrade image quality, while external\ndetection methods often suffer from low accuracy and inefficiency.\n In this paper, we introduce AEIOU, a defense framework that is Adaptable,\nEfficient, Interpretable, Optimizable, and Unified against NSFW prompts in T2I\nmodels. AEIOU extracts NSFW features from the hidden states of the model's text\nencoder, utilizing the separable nature of these features to detect NSFW\nprompts. The detection process is efficient, requiring minimal inference time.\nAEIOU also offers real-time interpretation of results and supports optimization\nthrough data augmentation techniques. The framework is versatile, accommodating\nvarious T2I architectures. Our extensive experiments show that AEIOU\nsignificantly outperforms both commercial and open-source moderation tools,\nachieving over 95% accuracy across all datasets and improving efficiency by at\nleast tenfold. It effectively counters adaptive attacks and excels in few-shot\nand multi-label scenarios.\n","authors":["Yiming Wang","Jiahao Chen","Qingming Li","Xing Yang","Shouling Ji"],"pdf_url":"https://arxiv.org/pdf/2412.18123v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18120v1","updated":"2024-12-24T03:06:52Z","published":"2024-12-24T03:06:52Z","title":"Do Language Models Understand the Cognitive Tasks Given to Them?\n Investigations with the N-Back Paradigm","summary":" Cognitive tasks originally developed for humans are now increasingly used to\nstudy language models. While applying these tasks is often straightforward,\ninterpreting their results can be challenging. In particular, when a model\nunderperforms, it's often unclear whether this results from a limitation in the\ncognitive ability being tested or a failure to understand the task itself. A\nrecent study argued that GPT 3.5's declining performance on 2-back and 3-back\ntasks reflects a working memory capacity limit similar to humans. By analyzing\na range of open-source language models of varying performance levels on these\ntasks, we show that the poor performance instead reflects a limitation in task\ncomprehension and task set maintenance. In addition, we push the best\nperforming model to higher n values and experiment with alternative prompting\nstrategies, before analyzing model attentions. Our larger aim is to contribute\nto the ongoing conversation around refining methodologies for the cognitive\nevaluation of language models.\n","authors":["Xiaoyang Hu","Richard L. Lewis"],"pdf_url":"https://arxiv.org/pdf/2412.18120v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16844v2","updated":"2024-12-24T03:02:32Z","published":"2024-12-22T03:43:51Z","title":"Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with\n an LLM-Enabled Simulation","summary":" Emergency response services are vital for enhancing public safety by\nsafeguarding the environment, property, and human lives. As frontline members\nof these services, 9-1-1 dispatchers have a direct impact on response times and\nthe overall effectiveness of emergency operations. However, traditional\ndispatcher training methods, which rely on role-playing by experienced\npersonnel, are labor-intensive, time-consuming, and often neglect the specific\nneeds of underserved communities. To address these challenges, we introduce\nSim911, the first training simulation for 9-1-1 dispatchers powered by Large\nLanguage Models (LLMs). Sim911 enhances training through three key technical\ninnovations: (1) knowledge construction, which utilizes archived 9-1-1 call\ndata to generate simulations that closely mirror real-world scenarios; (2)\ncontext-aware controlled generation, which employs dynamic prompts and vector\nbases to ensure that LLM behavior aligns with training objectives; and (3)\nvalidation with looped correction, which filters out low-quality responses and\nrefines the system performance.\n","authors":["Zirong Chen","Elizabeth Chason","Noah Mladenovski","Erin Wilson","Kristin Mullen","Stephen Martini","Meiyi Ma"],"pdf_url":"https://arxiv.org/pdf/2412.16844v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02371v2","updated":"2024-12-24T02:20:25Z","published":"2024-12-03T10:57:19Z","title":"TSCheater: Generating High-Quality Tibetan Adversarial Texts via Visual\n Similarity","summary":" Language models based on deep neural networks are vulnerable to textual\nadversarial attacks. While rich-resource languages like English are receiving\nfocused attention, Tibetan, a cross-border language, is gradually being studied\ndue to its abundant ancient literature and critical language strategy.\nCurrently, there are several Tibetan adversarial text generation methods, but\nthey do not fully consider the textual features of Tibetan script and\noverestimate the quality of generated adversarial texts. To address this issue,\nwe propose a novel Tibetan adversarial text generation method called TSCheater,\nwhich considers the characteristic of Tibetan encoding and the feature that\nvisually similar syllables have similar semantics. This method can also be\ntransferred to other abugidas, such as Devanagari script. We utilize a\nself-constructed Tibetan syllable visual similarity database called TSVSDB to\ngenerate substitution candidates and adopt a greedy algorithm-based scoring\nmechanism to determine substitution order. After that, we conduct the method on\neight victim language models. Experimentally, TSCheater outperforms existing\nmethods in attack effectiveness, perturbation magnitude, semantic similarity,\nvisual similarity, and human acceptance. Finally, we construct the first\nTibetan adversarial robustness evaluation benchmark called AdvTS, which is\ngenerated by existing methods and proofread by humans.\n","authors":["Xi Cao","Quzong Gesang","Yuan Sun","Nuo Qun","Tashi Nyima"],"pdf_url":"https://arxiv.org/pdf/2412.02371v2.pdf","comment":"Pre-Camera-Ready Version; Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.07573v2","updated":"2024-12-24T02:09:03Z","published":"2024-12-10T15:06:48Z","title":"Subtopic-aware View Sampling and Temporal Aggregation for Long-form\n Document Matching","summary":" Long-form document matching aims to judge the relevance between two documents\nand has been applied to various scenarios. Most existing works utilize\nhierarchical or long context models to process documents, which achieve coarse\nunderstanding but may ignore details. Some researchers construct a document\nview with similar sentences about aligned document subtopics to focus on\ndetailed matching signals. However, a long document generally contains multiple\nsubtopics. The matching signals are heterogeneous from multiple topics.\nConsidering only the homologous aligned subtopics may not be representative\nenough and may cause biased modeling. In this paper, we introduce a new\nframework to model representative matching signals. First, we propose to\ncapture various matching signals through subtopics of document pairs. Next, We\nconstruct multiple document views based on subtopics to cover heterogeneous and\nvaluable details. However, existing spatial aggregation methods like attention,\nwhich integrate all these views simultaneously, are hard to integrate\nheterogeneous information. Instead, we propose temporal aggregation, which\neffectively integrates different views gradually as the training progresses.\nExperimental results show that our learning framework is effective on several\ndocument-matching tasks, including news duplication and legal case retrieval.\n","authors":["Youchao Zhou","Heyan Huang","Zhijing Wu","Yuhang Liu","Xinglin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.07573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18093v1","updated":"2024-12-24T02:08:38Z","published":"2024-12-24T02:08:38Z","title":"Molly: Making Large Language Model Agents Solve Python Problem More\n Logically","summary":" Applying large language models (LLMs) as teaching assists has attracted much\nattention as an integral part of intelligent education, particularly in\ncomputing courses. To reduce the gap between the LLMs and the computer\nprogramming education expert, fine-tuning and retrieval augmented generation\n(RAG) are the two mainstream methods in existing researches. However,\nfine-tuning for specific tasks is resource-intensive and may diminish the\nmodel`s generalization capabilities. RAG can perform well on reducing the\nillusion of LLMs, but the generation of irrelevant factual content during\nreasoning can cause significant confusion for learners. To address these\nproblems, we introduce the Molly agent, focusing on solving the proposed\nproblem encountered by learners when learning Python programming language. Our\nagent automatically parse the learners' questioning intent through a\nscenario-based interaction, enabling precise retrieval of relevant documents\nfrom the constructed knowledge base. At generation stage, the agent reflect on\nthe generated responses to ensure that they not only align with factual content\nbut also effectively answer the user's queries. Extensive experimentation on a\nconstructed Chinese Python QA dataset shows the effectiveness of the Molly\nagent, indicating an enhancement in its performance for providing useful\nresponses to Python questions.\n","authors":["Rui Xiao","Jiong Wang","Lu Han","Na Zong","Han Wu"],"pdf_url":"https://arxiv.org/pdf/2412.18093v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2402.07913"},{"id":"http://arxiv.org/abs/2410.10179v2","updated":"2024-12-24T01:56:40Z","published":"2024-10-14T05:54:11Z","title":"Is Parameter Collision Hindering Continual Learning in LLMs?","summary":" Large Language Models (LLMs) often suffer from catastrophic forgetting when\nlearning multiple tasks sequentially, making continual learning (CL) essential\nfor their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as\nO-LoRA, typically focus on constructing orthogonality tasks to decouple\nparameter interdependence from various domains.In this paper, we reveal that\nbuilding non-collision parameters is a more critical factor in addressing CL\nchallenges. Our theoretical and experimental analyses demonstrate that\nnon-collision parameters can provide better task orthogonality, which is a\nsufficient but unnecessary condition. Furthermore, knowledge from multiple\ndomains will be preserved in non-collision parameter subspaces, making it more\ndifficult to forget previously seen data. Leveraging this insight, we propose\nNon-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach\nleveraging low collision rates to enhance CL in LLMs. Experimental results on\nmultiple CL benchmarks indicate that N-LoRA achieves superior performance\n(+2.9), higher task orthogonality (*4.1 times), and lower parameter collision\n(*58.1 times) than SOTA methods.\n","authors":["Shuo Yang","Kun-Peng Ning","Yu-Yang Liu","Jia-Yu Yao","Yong-Hong Tian","Yi-Bing Song","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2410.10179v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18086v1","updated":"2024-12-24T01:52:19Z","published":"2024-12-24T01:52:19Z","title":"Generating Traffic Scenarios via In-Context Learning to Learn Better\n Motion Planner","summary":" Motion planning is a crucial component in autonomous driving.\nState-of-the-art motion planners are trained on meticulously curated datasets,\nwhich are not only expensive to annotate but also insufficient in capturing\nrarely seen critical scenarios. Failing to account for such scenarios poses a\nsignificant risk to motion planners and may lead to incidents during testing.\nAn intuitive solution is to manually compose such scenarios by programming and\nexecuting a simulator (e.g., CARLA). However, this approach incurs substantial\nhuman costs. Motivated by this, we propose an inexpensive method for generating\ndiverse critical traffic scenarios to train more robust motion planners. First,\nwe represent traffic scenarios as scripts, which are then used by the simulator\nto generate traffic scenarios. Next, we develop a method that accepts\nuser-specified text descriptions, which a Large Language Model (LLM) translates\ninto scripts using in-context learning. The output scripts are sent to the\nsimulator that produces the corresponding traffic scenarios. As our method can\ngenerate abundant safety-critical traffic scenarios, we use them as synthetic\ntraining data for motion planners. To demonstrate the value of generated\nscenarios, we train existing motion planners on our synthetic data, real-world\ndatasets, and a combination of both. Our experiments show that motion planners\ntrained with our data significantly outperform those trained solely on\nreal-world data, showing the usefulness of our synthetic data and the\neffectiveness of our data generation method. Our source code is available at\nhttps://ezharjan.github.io/AutoSceneGen.\n","authors":["Aizierjiang Aiersilan"],"pdf_url":"https://arxiv.org/pdf/2412.18086v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.11711v2","updated":"2024-12-24T01:43:42Z","published":"2024-12-16T12:33:12Z","title":"MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for\n Table Reasoning","summary":" Extensive research has been conducted to explore the capability of Large\nLanguage Models (LLMs) for table reasoning and has significantly improved the\nperformance on existing benchmarks. However, tables and user questions in\nreal-world applications are more complex and diverse, presenting an unignorable\ngap compared to the existing benchmarks. To fill the gap, we propose a\n\\textbf{M}ult\\textbf{i}-scale spreadsheet benchmark with \\textbf{M}eta\n\\textbf{o}perations for \\textbf{Table} reasoning, named as MiMoTable.\nSpecifically, MiMoTable incorporates two key features. First, the tables in\nMiMoTable are all spreadsheets used in real-world scenarios, which cover seven\ndomains and contain different types. Second, we define a new criterion with six\ncategories of meta operations for measuring the difficulty of each question in\nMiMoTable, simultaneously as a new perspective for measuring the difficulty of\nthe existing benchmarks. Experimental results show that Claude-3.5-Sonnet\nachieves the best performance with 77.4\\% accuracy, indicating that there is\nstill significant room to improve for LLMs on MiMoTable. Furthermore, we grade\nthe difficulty of existing benchmarks according to our new criteria.\nExperiments have shown that the performance of LLMs decreases as the difficulty\nof benchmarks increases, thereby proving the effectiveness of our proposed new\ncriterion.\n","authors":["Zheng Li","Yang Du","Mao Zheng","Mingyang Song"],"pdf_url":"https://arxiv.org/pdf/2412.11711v2.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2403.11802v5","updated":"2024-12-24T01:41:28Z","published":"2024-03-18T14:01:45Z","title":"Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark\n for Evaluating Long-Context Large Language Models","summary":" Despite recent efforts to develop large language models with robust\nlong-context capabilities, the lack of long-context benchmarks means that\nrelatively little is known about their performance. To alleviate this gap, in\nthis paper, we propose \\textbf{Counting-Stars}, a multi-evidence,\nposition-aware, and scalable benchmark designed to evaluate the multi-evidence\nretrieval capabilities of long-context LLMs. \\textbf{Counting-Stars} comprises\ntwo counting-based multiple pieces of evidence retrieval sub-tasks: searching\nand reasoning. Using Counting-Stars, we conduct experiments to evaluate several\nlong-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4,\nand Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro\nachieves the best overall results, while GPT-4 Turbo exhibits the most stable\nperformance across various tasks. Furthermore, our analysis of these LLMs,\nwhich have been extended to handle long-context scenarios, indicates that\nsignificant room for improvement remains as the length of the input context and\nthe complexity of the tasks increase.\n","authors":["Mingyang Song","Mao Zheng","Xuan Luo"],"pdf_url":"https://arxiv.org/pdf/2403.11802v5.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2410.03103v2","updated":"2024-12-24T01:37:23Z","published":"2024-10-04T02:53:52Z","title":"Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for\n Code Generation with Lookahead Planning","summary":" Fill-in-the-Middle (FIM) has become integral to code language models,\nenabling generation of missing code given both left and right contexts.\nHowever, the current FIM training paradigm, which reorders original training\nsequences and then performs regular next-token prediction (NTP), often leads to\nmodels struggling to generate content that aligns smoothly with the surrounding\ncontext. Crucially, while existing works rely on rule-based post-processing to\ncircumvent this weakness, such methods are not practically usable in\nopen-domain code completion tasks as they depend on restrictive,\ndataset-specific assumptions (e.g., generating the same number of lines as in\nthe ground truth). Moreover, model performance on FIM tasks deteriorates\nsignificantly without these unrealistic assumptions.\n We hypothesize that NTP alone is insufficient for models to learn effective\nplanning conditioned on the distant right context, a critical factor for\nsuccessful code infilling. To overcome this, we propose Horizon-Length\nPrediction (HLP), a novel training objective that teaches models to predict the\nnumber of remaining middle tokens (i.e., horizon length) at each step. HLP\nadvances FIM with lookahead planning, enabling models to inherently learn\ninfilling boundaries for arbitrary left and right contexts without relying on\ndataset-specific post-processing. Our evaluation across different models and\nsizes shows that HLP significantly improves FIM performance by up to 24%\nrelatively on diverse benchmarks, across file-level and repository-level, and\nwithout resorting to unrealistic post-processing methods. Furthermore, the\nenhanced planning capability gained through HLP boosts model performance on\ncode reasoning. Importantly, HLP only incurs negligible training overhead and\nno additional inference cost, ensuring its practicality for real-world\nscenarios.\n","authors":["Yifeng Ding","Hantian Ding","Shiqi Wang","Qing Sun","Varun Kumar","Zijian Wang"],"pdf_url":"https://arxiv.org/pdf/2410.03103v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18072v1","updated":"2024-12-24T00:59:16Z","published":"2024-12-24T00:59:16Z","title":"MMFactory: A Universal Solution Search Engine for Vision-Language Tasks","summary":" With advances in foundational and vision-language models, and effective\nfine-tuning techniques, a large number of both general and special-purpose\nmodels have been developed for a variety of visual tasks. Despite the\nflexibility and accessibility of these models, no single model is able to\nhandle all tasks and/or applications that may be envisioned by potential users.\nRecent approaches, such as visual programming and multimodal LLMs with\nintegrated tools aim to tackle complex visual tasks, by way of program\nsynthesis. However, such approaches overlook user constraints (e.g.,\nperformance / computational needs), produce test-time sample-specific solutions\nthat are difficult to deploy, and, sometimes, require low-level instructions\nthat maybe beyond the abilities of a naive user. To address these limitations,\nwe introduce MMFactory, a universal framework that includes model and metrics\nrouting components, acting like a solution search engine across various\navailable models. Based on a task description and few sample input-output pairs\nand (optionally) resource and/or performance constraints, MMFactory can suggest\na diverse pool of programmatic solutions by instantiating and combining\nvisio-lingual tools from its model repository. In addition to synthesizing\nthese solutions, MMFactory also proposes metrics and benchmarks performance /\nresource characteristics, allowing users to pick a solution that meets their\nunique design constraints. From the technical perspective, we also introduced a\ncommittee-based solution proposer that leverages multi-agent LLM conversation\nto generate executable, diverse, universal, and robust solutions for the user.\nExperimental results show that MMFactory outperforms existing methods by\ndelivering state-of-the-art solutions tailored to user problem specifications.\nProject page is available at https://davidhalladay.github.io/mmfactory_demo.\n","authors":["Wan-Cyuan Fan","Tanzila Rahman","Leonid Sigal"],"pdf_url":"https://arxiv.org/pdf/2412.18072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18069v1","updated":"2024-12-24T00:55:59Z","published":"2024-12-24T00:55:59Z","title":"Improving Factuality with Explicit Working Memory","summary":" Large language models can generate factually inaccurate content, a problem\nknown as hallucination. Recent works have built upon retrieved-augmented\ngeneration to improve factuality through iterative prompting but these methods\nare limited by the traditional RAG design. To address these challenges, we\nintroduce EWE (Explicit Working Memory), a novel approach that enhances\nfactuality in long-form text generation by integrating a working memory that\nreceives real-time feedback from external resources. The memory is refreshed\nbased on online fact-checking and retrieval feedback, allowing EWE to rectify\nfalse claims during the generation process and ensure more accurate and\nreliable outputs. Our experiments demonstrate that Ewe outperforms strong\nbaselines on four fact-seeking long-form generation datasets, increasing the\nfactuality metric, VeriScore, by 2 to 10 points absolute without sacrificing\nthe helpfulness of the responses. Further analysis reveals that the design of\nrules for memory updates, configurations of memory units, and the quality of\nthe retrieval datastore are crucial factors for influencing model performance.\n","authors":["Mingda Chen","Yang Li","Karthik Padthe","Rulin Shao","Alicia Sun","Luke Zettlemoyer","Gargi Gosh","Wen-tau Yih"],"pdf_url":"https://arxiv.org/pdf/2412.18069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18061v1","updated":"2024-12-24T00:20:38Z","published":"2024-12-24T00:20:38Z","title":"Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction","summary":" Turn-taking prediction is the task of anticipating when the speaker in a\nconversation will yield their turn to another speaker to begin speaking. This\nproject expands on existing strategies for turn-taking prediction by employing\na multi-modal ensemble approach that integrates large language models (LLMs)\nand voice activity projection (VAP) models. By combining the linguistic\ncapabilities of LLMs with the temporal precision of VAP models, we aim to\nimprove the accuracy and efficiency of identifying TRPs in both scripted and\nunscripted conversational scenarios. Our methods are evaluated on the\nIn-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation\n(CCPE) datasets, highlighting the strengths and limitations of current models\nwhile proposing a potentially more robust framework for enhanced prediction.\n","authors":["Hyunbae Jeon","Frederic Guintu","Rayvant Sahni"],"pdf_url":"https://arxiv.org/pdf/2412.18061v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18053v1","updated":"2024-12-24T00:01:24Z","published":"2024-12-24T00:01:24Z","title":"Neuron Empirical Gradient: Connecting Neurons' Linear Controllability\n and Representational Capacity","summary":" Although neurons in the feed-forward layers of pre-trained language models\n(PLMs) can store factual knowledge, most prior analyses remain qualitative,\nleaving the quantitative relationship among knowledge representation, neuron\nactivations, and model output poorly understood. In this study, by performing\nneuron-wise interventions using factual probing datasets, we first reveal the\nlinear relationship between neuron activations and output token probabilities.\nWe refer to the gradient of this linear relationship as ``neuron empirical\ngradients.'' and propose NeurGrad, an efficient method for their calculation to\nfacilitate quantitative neuron analysis. We next investigate whether neuron\nempirical gradients in PLMs encode general task knowledge by probing skill\nneurons. To this end, we introduce MCEval8k, a multi-choice knowledge\nevaluation benchmark spanning six genres and 22 tasks. Our experiments confirm\nthat neuron empirical gradients effectively capture knowledge, while skill\nneurons exhibit efficiency, generality, inclusivity, and interdependency. These\nfindings link knowledge to PLM outputs via neuron empirical gradients, shedding\nlight on how PLMs store knowledge. The code and dataset are released.\n","authors":["Xin Zhao","Zehui Jiang","Naoki Yoshinaga"],"pdf_url":"https://arxiv.org/pdf/2412.18053v1.pdf","comment":"29 pages, 18 figures"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2412.18609v1","updated":"2024-12-24T18:59:56Z","published":"2024-12-24T18:59:56Z","title":"Video-Panda: Parameter-efficient Alignment for Encoder-free\n Video-Language Models","summary":" We present an efficient encoder-free approach for video-language\nunderstanding that achieves competitive performance while significantly\nreducing computational overhead. Current video-language models typically rely\non heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B\nparameters), creating a substantial computational burden when processing\nmulti-frame videos. Our method introduces a novel Spatio-Temporal Alignment\nBlock (STAB) that directly processes video inputs without requiring pre-trained\nencoders while using only 45M parameters for visual processing - at least a\n6.5$\\times$ reduction compared to traditional approaches. The STAB architecture\ncombines Local Spatio-Temporal Encoding for fine-grained feature extraction,\nefficient spatial downsampling through learned attention and separate\nmechanisms for modeling frame-level and video-level relationships. Our model\nachieves comparable or superior performance to encoder-based approaches for\nopen-ended video question answering on standard benchmarks. The fine-grained\nvideo question-answering evaluation demonstrates our model's effectiveness,\noutperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key\naspects like correctness and temporal understanding. Extensive ablation studies\nvalidate our architectural choices and demonstrate the effectiveness of our\nspatio-temporal modeling approach while achieving 3-4$\\times$ faster processing\nspeeds than previous methods. Code is available at\n\\url{https://github.com/jh-yi/Video-Panda}.\n","authors":["Jinhui Yi","Syed Talal Wasim","Yanan Luo","Muzammal Naseer","Juergen Gall"],"pdf_url":"https://arxiv.org/pdf/2412.18609v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18608v1","updated":"2024-12-24T18:59:43Z","published":"2024-12-24T18:59:43Z","title":"PartGen: Part-level 3D Generation and Reconstruction with Multi-View\n Diffusion Models","summary":" Text- or image-to-3D generators and 3D scanners can now produce 3D assets\nwith high-quality shapes and textures. These assets typically consist of a\nsingle, fused representation, like an implicit neural field, a Gaussian\nmixture, or a mesh, without any useful structure. However, most applications\nand creative workflows require assets to be made of several meaningful parts\nthat can be manipulated independently. To address this gap, we introduce\nPartGen, a novel approach that generates 3D objects composed of meaningful\nparts starting from text, an image, or an unstructured 3D object. First, given\nmultiple views of a 3D object, generated or rendered, a multi-view diffusion\nmodel extracts a set of plausible and view-consistent part segmentations,\ndividing the object into parts. Then, a second multi-view diffusion model takes\neach part separately, fills in the occlusions, and uses those completed views\nfor 3D reconstruction by feeding them to a 3D reconstruction network. This\ncompletion process considers the context of the entire object to ensure that\nthe parts integrate cohesively. The generative completion model can make up for\nthe information missing due to occlusions; in extreme cases, it can hallucinate\nentirely invisible parts based on the input 3D asset. We evaluate our method on\ngenerated and real 3D assets and show that it outperforms segmentation and\npart-extraction baselines by a large margin. We also showcase downstream\napplications such as 3D part editing.\n","authors":["Minghao Chen","Roman Shapovalov","Iro Laina","Tom Monnier","Jianyuan Wang","David Novotny","Andrea Vedaldi"],"pdf_url":"https://arxiv.org/pdf/2412.18608v1.pdf","comment":"Project Page: https://silent-chen.github.io/PartGen/"},{"id":"http://arxiv.org/abs/2412.18607v1","updated":"2024-12-24T18:59:37Z","published":"2024-12-24T18:59:37Z","title":"DrivingGPT: Unifying Driving World Modeling and Planning with\n Multi-modal Autoregressive Transformers","summary":" World model-based searching and planning are widely recognized as a promising\npath toward human-level physical intelligence. However, current driving world\nmodels primarily rely on video diffusion models, which specialize in visual\ngeneration but lack the flexibility to incorporate other modalities like\naction. In contrast, autoregressive transformers have demonstrated exceptional\ncapability in modeling multimodal data. Our work aims to unify both driving\nmodel simulation and trajectory planning into a single sequence modeling\nproblem. We introduce a multimodal driving language based on interleaved image\nand action tokens, and develop DrivingGPT to learn joint world modeling and\nplanning through standard next-token prediction. Our DrivingGPT demonstrates\nstrong performance in both action-conditioned video generation and end-to-end\nplanning, outperforming strong baselines on large-scale nuPlan and NAVSIM\nbenchmarks.\n","authors":["Yuntao Chen","Yuqi Wang","Zhaoxiang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18605v1","updated":"2024-12-24T18:58:43Z","published":"2024-12-24T18:58:43Z","title":"Orient Anything: Learning Robust Object Orientation Estimation from\n Rendering 3D Models","summary":" Orientation is a key attribute of objects, crucial for understanding their\nspatial pose and arrangement in images. However, practical solutions for\naccurate orientation estimation from a single image remain underexplored. In\nthis work, we introduce Orient Anything, the first expert and foundational\nmodel designed to estimate object orientation in a single- and free-view image.\nDue to the scarcity of labeled data, we propose extracting knowledge from the\n3D world. By developing a pipeline to annotate the front face of 3D objects and\nrender images from random views, we collect 2M images with precise orientation\nannotations. To fully leverage the dataset, we design a robust training\nobjective that models the 3D orientation as probability distributions of three\nangles and predicts the object orientation by fitting these distributions.\nBesides, we employ several strategies to improve synthetic-to-real transfer.\nOur model achieves state-of-the-art orientation estimation accuracy in both\nrendered and real images and exhibits impressive zero-shot ability in various\nscenarios. More importantly, our model enhances many applications, such as\ncomprehension and generation of complex spatial concepts and 3D object pose\nadjustment.\n","authors":["Zehan Wang","Ziang Zhang","Tianyu Pang","Chao Du","Hengshuang Zhao","Zhou Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.18605v1.pdf","comment":"Project Page: https://orient-anything.github.io/"},{"id":"http://arxiv.org/abs/2412.18604v1","updated":"2024-12-24T18:58:28Z","published":"2024-12-24T18:58:28Z","title":"Explaining in Diffusion: Explaining a Classifier Through Hierarchical\n Semantics with Text-to-Image Diffusion Models","summary":" Classifiers are important components in many computer vision tasks, serving\nas the foundational backbone of a wide variety of models employed across\ndiverse applications. However, understanding the decision-making process of\nclassifiers remains a significant challenge. We propose DiffEx, a novel method\nthat leverages the capabilities of text-to-image diffusion models to explain\nclassifier decisions. Unlike traditional GAN-based explainability models, which\nare limited to simple, single-concept analyses and typically require training a\nnew model for each classifier, our approach can explain classifiers that focus\non single concepts (such as faces or animals) as well as those that handle\ncomplex scenes involving multiple concepts. DiffEx employs vision-language\nmodels to create a hierarchical list of semantics, allowing users to identify\nnot only the overarching semantic influences on classifiers (e.g., the 'beard'\nsemantic in a facial classifier) but also their sub-types, such as 'goatee' or\n'Balbo' beard. Our experiments demonstrate that DiffEx is able to cover a\nsignificantly broader spectrum of semantics compared to its GAN counterparts,\nproviding a hierarchical tool that delivers a more detailed and fine-grained\nunderstanding of classifier decisions.\n","authors":["Tahira Kazimi","Ritika Allada","Pinar Yanardag"],"pdf_url":"https://arxiv.org/pdf/2412.18604v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18600v1","updated":"2024-12-24T18:55:38Z","published":"2024-12-24T18:55:38Z","title":"ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation","summary":" Human-scene interaction (HSI) generation is crucial for applications in\nembodied AI, virtual reality, and robotics. While existing methods can\nsynthesize realistic human motions in 3D scenes and generate plausible\nhuman-object interactions, they heavily rely on datasets containing paired 3D\nscene and motion capture data, which are expensive and time-consuming to\ncollect across diverse environments and interactions. We present ZeroHSI, a\nnovel approach that enables zero-shot 4D human-scene interaction synthesis by\nintegrating video generation and neural human rendering. Our key insight is to\nleverage the rich motion priors learned by state-of-the-art video generation\nmodels, which have been trained on vast amounts of natural human movements and\ninteractions, and use differentiable rendering to reconstruct human-scene\ninteractions. ZeroHSI can synthesize realistic human motions in both static\nscenes and environments with dynamic objects, without requiring any\nground-truth motion data. We evaluate ZeroHSI on a curated dataset of different\ntypes of various indoor and outdoor scenes with different interaction prompts,\ndemonstrating its ability to generate diverse and contextually appropriate\nhuman-scene interactions.\n","authors":["Hongjie Li","Hong-Xing Yu","Jiaman Li","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2412.18600v1.pdf","comment":"Project website: https://awfuact.github.io/zerohsi/"},{"id":"http://arxiv.org/abs/2412.18597v1","updated":"2024-12-24T18:51:19Z","published":"2024-12-24T18:51:19Z","title":"DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion\n Transformer for Tuning-Free Multi-Prompt Longer Video Generation","summary":" Sora-like video generation models have achieved remarkable progress with a\nMulti-Modal Diffusion Transformer MM-DiT architecture. However, the current\nvideo generation models predominantly focus on single-prompt, struggling to\ngenerate coherent scenes with multiple sequential prompts that better reflect\nreal-world dynamic scenarios. While some pioneering works have explored\nmulti-prompt video generation, they face significant challenges including\nstrict training data requirements, weak prompt following, and unnatural\ntransitions. To address these problems, we propose DiTCtrl, a training-free\nmulti-prompt video generation method under MM-DiT architectures for the first\ntime. Our key idea is to take the multi-prompt video generation task as\ntemporal video editing with smooth transitions. To achieve this goal, we first\nanalyze MM-DiT's attention mechanism, finding that the 3D full attention\nbehaves similarly to that of the cross/self-attention blocks in the UNet-like\ndiffusion models, enabling mask-guided precise semantic control across\ndifferent prompts with attention sharing for multi-prompt video generation.\nBased on our careful design, the video generated by DiTCtrl achieves smooth\ntransitions and consistent object motion given multiple sequential prompts\nwithout additional training. Besides, we also present MPVBench, a new benchmark\nspecially designed for multi-prompt video generation to evaluate the\nperformance of multi-prompt generation. Extensive experiments demonstrate that\nour method achieves state-of-the-art performance without additional training.\n","authors":["Minghong Cai","Xiaodong Cun","Xiaoyu Li","Wenze Liu","Zhaoyang Zhang","Yong Zhang","Ying Shan","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2412.18597v1.pdf","comment":"19 pages, 19 figures, Project page:\n https://onevfall.github.io/project_page/ditctrl ; GitHub repository:\n https://github.com/TencentARC/DiTCtrl"},{"id":"http://arxiv.org/abs/2412.18596v1","updated":"2024-12-24T18:51:11Z","published":"2024-12-24T18:51:11Z","title":"LatentCRF: Continuous CRF for Efficient Latent Diffusion","summary":" Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images,\nhowever, the latency incurred by multiple costly inference iterations can\nrestrict their applicability. We introduce LatentCRF, a continuous Conditional\nRandom Field (CRF) model, implemented as a neural network layer, that models\nthe spatial and semantic relationships among the latent vectors in the LDM. By\nreplacing some of the computationally-intensive LDM inference iterations with\nour lightweight LatentCRF, we achieve a superior balance between quality, speed\nand diversity. We increase inference efficiency by 33% with no loss in image\nquality or diversity compared to the full LDM. LatentCRF is an easy add-on,\nwhich does not require modifying the LDM.\n","authors":["Kanchana Ranasinghe","Sadeep Jayasumana","Andreas Veit","Ayan Chakrabarti","Daniel Glasner","Michael S Ryoo","Srikumar Ramalingam","Sanjiv Kumar"],"pdf_url":"https://arxiv.org/pdf/2412.18596v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18591v1","updated":"2024-12-24T18:45:14Z","published":"2024-12-24T18:45:14Z","title":"ClassifyViStA:WCE Classification with Visual understanding through\n Segmentation and Attention","summary":" Gastrointestinal (GI) bleeding is a serious medical condition that presents\nsignificant diagnostic challenges, particularly in settings with limited access\nto healthcare resources. Wireless Capsule Endoscopy (WCE) has emerged as a\npowerful diagnostic tool for visualizing the GI tract, but it requires\ntime-consuming manual analysis by experienced gastroenterologists, which is\nprone to human error and inefficient given the increasing number of patients.To\naddress this challenge, we propose ClassifyViStA, an AI-based framework\ndesigned for the automated detection and classification of bleeding and\nnon-bleeding frames from WCE videos. The model consists of a standard\nclassification path, augmented by two specialized branches: an implicit\nattention branch and a segmentation branch.The attention branch focuses on the\nbleeding regions, while the segmentation branch generates accurate segmentation\nmasks, which are used for classification and interpretability. The model is\nbuilt upon an ensemble of ResNet18 and VGG16 architectures to enhance\nclassification performance. For the bleeding region detection, we implement a\nSoft Non-Maximum Suppression (Soft NMS) approach with YOLOv8, which improves\nthe handling of overlapping bounding boxes, resulting in more accurate and\nnuanced detections.The system's interpretability is enhanced by using the\nsegmentation masks to explain the classification results, offering insights\ninto the decision-making process similar to the way a gastroenterologist\nidentifies bleeding regions. Our approach not only automates the detection of\nGI bleeding but also provides an interpretable solution that can ease the\nburden on healthcare professionals and improve diagnostic efficiency. Our code\nis available at ClassifyViStA.\n","authors":["S. Balasubramanian","Ammu Abhishek","Yedu Krishna","Darshan Gera"],"pdf_url":"https://arxiv.org/pdf/2412.18591v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18589v1","updated":"2024-12-24T18:43:09Z","published":"2024-12-24T18:43:09Z","title":"Text-Driven Tumor Synthesis","summary":" Tumor synthesis can generate examples that AI often misses or over-detects,\nimproving AI performance by training on these challenging cases. However,\nexisting synthesis methods, which are typically unconditional -- generating\nimages from random variables -- or conditioned only by tumor shapes, lack\ncontrollability over specific tumor characteristics such as texture,\nheterogeneity, boundaries, and pathology type. As a result, the generated\ntumors may be overly similar or duplicates of existing training data, failing\nto effectively address AI's weaknesses. We propose a new text-driven tumor\nsynthesis approach, termed TextoMorph, that provides textual control over tumor\ncharacteristics. This is particularly beneficial for examples that confuse the\nAI the most, such as early tumor detection (increasing Sensitivity by +8.5%),\ntumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and\nclassification between benign and malignant tumors (improving Sensitivity by\n+8.2%). By incorporating text mined from radiology reports into the synthesis\nprocess, we increase the variability and controllability of the synthetic\ntumors to target AI's failure cases more precisely. Moreover, TextoMorph uses\ncontrastive learning across different texts and CT scans, significantly\nreducing dependence on scarce image-report pairs (only 141 pairs used in this\nstudy) by leveraging a large corpus of 34,035 radiology reports. Finally, we\nhave developed rigorous tests to evaluate synthetic tumors, including\nText-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our\nsynthetic tumors is realistic and diverse in texture, heterogeneity,\nboundaries, and pathology.\n","authors":["Xinran Li","Yi Shuai","Chen Liu","Qi Chen","Qilong Wu","Pengfei Guo","Dong Yang","Can Zhao","Pedro R. A. S. Bassi","Daguang Xu","Kang Wang","Yang Yang","Alan Yuille","Zongwei Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.18589v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17105v2","updated":"2024-12-24T18:27:20Z","published":"2024-12-22T17:34:01Z","title":"Refining CNN-based Heatmap Regression with Gradient-based Corner Points\n for Electrode Localization","summary":" We propose a method for detecting the electrode positions in lithium-ion\nbatteries. The process begins by identifying the region of interest (ROI) in\nthe battery's X-ray image through corner point detection. A convolutional\nneural network is then used to regress the pole positions within this ROI.\nFinally, the regressed positions are optimized and corrected using corner point\npriors, significantly mitigating the loss of localization accuracy caused by\noperations such as feature map down-sampling and padding during network\ntraining. Our findings show that combining traditional pixel gradient analysis\nwith CNN-based heatmap regression for keypoint extraction enhances both\naccuracy and efficiency, resulting in significant performance improvements.\n","authors":["Lin Wu"],"pdf_url":"https://arxiv.org/pdf/2412.17105v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18584v1","updated":"2024-12-24T18:25:50Z","published":"2024-12-24T18:25:50Z","title":"Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors:\n Diverse-Resolution Training Outperforms Interpolation","summary":" Deep learning-based 3D imaging, in particular magnetic resonance imaging\n(MRI), is challenging because of limited availability of 3D training data.\nTherefore, 2D diffusion models trained on 2D slices are starting to be\nleveraged for 3D MRI reconstruction. However, as we show in this paper,\nexisting methods pertain to a fixed voxel size, and performance degrades when\nthe voxel size is varied, as it is often the case in clinical practice. In this\npaper, we propose and study several approaches for resolution-robust 3D MRI\nreconstruction with 2D diffusion priors. As a result of this investigation, we\nobtain a simple resolution-robust variational 3D reconstruction approach based\non diffusion-guided regularization of randomly sampled 2D slices. This method\nprovides competitive reconstruction quality compared to posterior sampling\nbaselines. Towards resolving the sensitivity to resolution-shifts, we\ninvestigate state-of-the-art model-based approaches including Gaussian\nsplatting, neural representations, and infinite-dimensional diffusion models,\nas well as a simple data-centric approach of training the diffusion model on\nseveral resolutions. Our experiments demonstrate that the model-based\napproaches fail to close the performance gap in 3D MRI. In contrast, the\ndata-centric approach of training the diffusion model on various resolutions\neffectively provides a resolution-robust method without compromising accuracy.\n","authors":["Anselm Krainovic","Stefan Ruschke","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2412.18584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18565v1","updated":"2024-12-24T17:36:34Z","published":"2024-12-24T17:36:34Z","title":"3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement","summary":" Despite advances in neural rendering, due to the scarcity of high-quality 3D\ndatasets and the inherent limitations of multi-view diffusion models, view\nsynthesis and 3D model generation are restricted to low resolutions with\nsuboptimal multi-view consistency. In this study, we present a novel 3D\nenhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent\ndiffusion model to enhance coarse 3D inputs while preserving multi-view\nconsistency. Our method includes a pose-aware encoder and a diffusion-based\ndenoiser to refine low-quality multi-view images, along with data augmentation\nand a multi-view attention module with epipolar aggregation to maintain\nconsistent, high-quality 3D outputs across views. Unlike existing video-based\napproaches, our model supports seamless multi-view enhancement with improved\ncoherence across diverse viewing angles. Extensive evaluations show that\n3DEnhancer significantly outperforms existing methods, boosting both multi-view\nenhancement and per-instance 3D optimization tasks.\n","authors":["Yihang Luo","Shangchen Zhou","Yushi Lan","Xingang Pan","Chen Change Loy"],"pdf_url":"https://arxiv.org/pdf/2412.18565v1.pdf","comment":"Project page: https://yihangluo.com/projects/3DEnhancer"},{"id":"http://arxiv.org/abs/2412.16662v2","updated":"2024-12-24T17:21:50Z","published":"2024-12-21T15:23:34Z","title":"Adversarial Attack Against Images Classification based on Generative\n Adversarial Networks","summary":" Adversarial attacks on image classification systems have always been an\nimportant problem in the field of machine learning, and generative adversarial\nnetworks (GANs), as popular models in the field of image generation, have been\nwidely used in various novel scenarios due to their powerful generative\ncapabilities. However, with the popularity of generative adversarial networks,\nthe misuse of fake image technology has raised a series of security problems,\nsuch as malicious tampering with other people's photos and videos, and invasion\nof personal privacy. Inspired by the generative adversarial networks, this work\nproposes a novel adversarial attack method, aiming to gain insight into the\nweaknesses of the image classification system and improve its anti-attack\nability. Specifically, the generative adversarial networks are used to generate\nadversarial samples with small perturbations but enough to affect the\ndecision-making of the classifier, and the adversarial samples are generated\nthrough the adversarial learning of the training generator and the classifier.\nFrom extensive experiment analysis, we evaluate the effectiveness of the method\non a classical image classification dataset, and the results show that our\nmodel successfully deceives a variety of advanced classifiers while maintaining\nthe naturalness of adversarial samples.\n","authors":["Yahe Yang"],"pdf_url":"https://arxiv.org/pdf/2412.16662v2.pdf","comment":"7 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.18545v1","updated":"2024-12-24T16:52:21Z","published":"2024-12-24T16:52:21Z","title":"Advancing Deformable Medical Image Registration with Multi-axis\n Cross-covariance Attention","summary":" Deformable image registration is a fundamental requirement for medical image\nanalysis. Recently, transformers have been widely used in deep learning-based\nregistration methods for their ability to capture long-range dependency via\nself-attention (SA). However, the high computation and memory loads of SA\n(growing quadratically with the spatial resolution) hinder transformers from\nprocessing subtle textural information in high-resolution image features, e.g.,\nat the full and half image resolutions. This limits deformable registration as\nthe high-resolution textural information is crucial for finding precise\npixel-wise correspondence between subtle anatomical structures.\nCross-covariance Attention (XCA), as a \"transposed\" version of SA that operates\nacross feature channels, has complexity growing linearly with the spatial\nresolution, providing the feasibility of capturing long-range dependency among\nhigh-resolution image features. However, existing XCA-based transformers merely\ncapture coarse global long-range dependency, which are unsuitable for\ndeformable image registration relying primarily on fine-grained local\ncorrespondence. In this study, we propose to improve existing deep\nlearning-based registration methods by embedding a new XCA mechanism. To this\nend, we design an XCA-based transformer block optimized for deformable medical\nimage registration, named Multi-Axis XCA (MAXCA). Our MAXCA serves as a general\nnetwork block that can be embedded into various registration network\narchitectures. It can capture both global and local long-range dependency among\nhigh-resolution image features by applying regional and dilated XCA in parallel\nvia a multi-axis design. Extensive experiments on two well-benchmarked\ninter-/intra-patient registration tasks with seven public medical datasets\ndemonstrate that our MAXCA block enables state-of-the-art registration\nperformance.\n","authors":["Mingyuan Meng","Michael Fulham","Lei Bi","Jinman Kim"],"pdf_url":"https://arxiv.org/pdf/2412.18545v1.pdf","comment":"Under Review"},{"id":"http://arxiv.org/abs/2412.18525v1","updated":"2024-12-24T16:08:25Z","published":"2024-12-24T16:08:25Z","title":"The Key of Understanding Vision Tasks: Explanatory Instructions","summary":" Computer Vision (CV) has yet to fully achieve the zero-shot task\ngeneralization observed in Natural Language Processing (NLP), despite following\nmany of the milestones established in NLP, such as large transformer models,\nextensive pre-training, and the auto-regression paradigm, among others. In this\npaper, we explore the idea that CV adopts discrete and terminological task\ndefinitions (\\eg, ``image segmentation''), which may be a key barrier to\nzero-shot task generalization. Our hypothesis is that without truly\nunderstanding previously-seen tasks--due to these terminological\ndefinitions--deep models struggle to generalize to novel tasks. To verify this,\nwe introduce Explanatory Instructions, which provide an intuitive way to define\nCV task objectives through detailed linguistic transformations from input\nimages to outputs. We create a large-scale dataset comprising 12 million\n``image input $\\to$ explanatory instruction $\\to$ output'' triplets, and train\nan auto-regressive-based vision-language model (AR-based VLM) that takes both\nimages and explanatory instructions as input. By learning to follow these\ninstructions, the AR-based VLM achieves instruction-level zero-shot\ncapabilities for previously-seen tasks and demonstrates strong zero-shot\ngeneralization for unseen CV tasks. Code and dataset will be openly available\non our GitHub repository.\n","authors":["Yang Shen","Xiu-Shen Wei","Yifan Sun","Yuxin Song","Tao Yuan","Jian Jin","Heyang Xu","Yazhou Yao","Errui Ding"],"pdf_url":"https://arxiv.org/pdf/2412.18525v1.pdf","comment":"40 pages"},{"id":"http://arxiv.org/abs/2412.18524v1","updated":"2024-12-24T16:08:24Z","published":"2024-12-24T16:08:24Z","title":"HTR-JAND: Handwritten Text Recognition with Joint Attention Network and\n Knowledge Distillation","summary":" Despite significant advances in deep learning, current Handwritten Text\nRecognition (HTR) systems struggle with the inherent complexity of historical\ndocuments, including diverse writing styles, degraded text quality, and\ncomputational efficiency requirements across multiple languages and time\nperiods. This paper introduces HTR-JAND (HTR-JAND: Handwritten Text Recognition\nwith Joint Attention Network and Knowledge Distillation), an efficient HTR\nframework that combines advanced feature extraction with knowledge\ndistillation. Our architecture incorporates three key components: (1) a CNN\narchitecture integrating FullGatedConv2d layers with Squeeze-and-Excitation\nblocks for adaptive feature extraction, (2) a Combined Attention mechanism\nfusing Multi-Head Self-Attention with Proxima Attention for robust sequence\nmodeling, and (3) a Knowledge Distillation framework enabling efficient model\ncompression while preserving accuracy through curriculum-based training. The\nHTR-JAND framework implements a multi-stage training approach combining\ncurriculum learning, synthetic data generation, and multi-task learning for\ncross-dataset knowledge transfer. We enhance recognition accuracy through\ncontext-aware T5 post-processing, particularly effective for historical\ndocuments. Comprehensive evaluations demonstrate HTR-JAND's effectiveness,\nachieving state-of-the-art Character Error Rates (CER) of 1.23\\%, 1.02\\%, and\n2.02\\% on IAM, RIMES, and Bentham datasets respectively. Our Student model\nachieves a 48\\% parameter reduction (0.75M versus 1.5M parameters) while\nmaintaining competitive performance through efficient knowledge transfer.\nSource code and pre-trained models are available at\n\\href{https://github.com/DocumentRecognitionModels/HTR-JAND}{Github}.\n","authors":["Mohammed Hamdan","Abderrahmane Rahiche","Mohamed Cheriet"],"pdf_url":"https://arxiv.org/pdf/2412.18524v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18505v1","updated":"2024-12-24T15:43:04Z","published":"2024-12-24T15:43:04Z","title":"VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry\n Extraction from First-Person View Flight Data","summary":" This paper presents the Visual Optical Recognition Telemetry EXtraction\n(VORTEX) system for extracting and analyzing drone telemetry data from First\nPerson View (FPV) Uncrewed Aerial System (UAS) footage. VORTEX employs MMOCR, a\nPyTorch-based Optical Character Recognition (OCR) toolbox, to extract telemetry\nvariables from drone Heads Up Display (HUD) recordings, utilizing advanced\nimage preprocessing techniques, including CLAHE enhancement and adaptive\nthresholding. The study optimizes spatial accuracy and computational efficiency\nthrough systematic investigation of temporal sampling rates (1s, 5s, 10s, 15s,\n20s) and coordinate processing methods. Results demonstrate that the 5-second\nsampling rate, utilizing 4.07% of available frames, provides the optimal\nbalance with a point retention rate of 64% and mean speed accuracy within 4.2%\nof the 1-second baseline while reducing computational overhead by 80.5%.\nComparative analysis of coordinate processing methods reveals that while UTM\nZone 33N projection and Haversine calculations provide consistently similar\nresults (within 0.1% difference), raw WGS84 coordinates underestimate distances\nby 15-30% and speeds by 20-35%. Altitude measurements showed unexpected\nresilience to sampling rate variations, with only 2.1% variation across all\nintervals. This research is the first of its kind, providing quantitative\nbenchmarks for establishing a robust framework for drone telemetry extraction\nand analysis using open-source tools and spatial libraries.\n","authors":["James E. Gallagher","Edward J. Oughton"],"pdf_url":"https://arxiv.org/pdf/2412.18505v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.10080v2","updated":"2024-12-24T15:22:46Z","published":"2024-09-16T08:37:09Z","title":"DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality\n Image Fusion","summary":" In extreme scenarios such as nighttime or low-visibility environments,\nachieving reliable perception is critical for applications like autonomous\ndriving, robotics, and surveillance. Multi-modality image fusion, particularly\nintegrating infrared imaging, offers a robust solution by combining\ncomplementary information from different modalities to enhance scene\nunderstanding and decision-making. However, current methods face significant\nlimitations: GAN-based approaches often produce blurry images that lack\nfine-grained details, while AE-based methods may introduce bias toward specific\nmodalities, leading to unnatural fusion results. To address these challenges,\nwe propose DAE-Fuse, a novel two-phase discriminative autoencoder framework\nthat generates sharp and natural fused images. Furthermore, We pioneer the\nextension of image fusion techniques from static images to the video domain\nwhile preserving temporal consistency across frames, thus advancing the\nperceptual capabilities required for autonomous navigation. Extensive\nexperiments on public datasets demonstrate that DAE-Fuse achieves\nstate-of-the-art performance on multiple benchmarks, with superior\ngeneralizability to tasks like medical image fusion.\n","authors":["Yuchen Guo","Ruoxiang Xu","Rongcheng Li","Zhenghao Wu","Weifeng Su"],"pdf_url":"https://arxiv.org/pdf/2409.10080v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18483v1","updated":"2024-12-24T15:14:58Z","published":"2024-12-24T15:14:58Z","title":"A region-wide, multi-year set of crop field boundary labels for Africa","summary":" African agriculture is undergoing rapid transformation. Annual maps of crop\nfields are key to understanding the nature of this transformation, but such\nmaps are currently lacking and must be developed using advanced machine\nlearning models trained on high resolution remote sensing imagery. To enable\nthe development of such models, we delineated field boundaries in 33,746 Planet\nimages captured between 2017 and 2023 across the continent using a custom\nlabeling platform with built-in procedures for assessing and mitigating label\nerror. We collected 42,403 labels, including 7,204 labels arising from tasks\ndedicated to assessing label quality (Class 1 labels), 32,167 from sites mapped\nonce by a single labeller (Class 2) and 3,032 labels from sites where 3 or more\nlabellers were tasked to map the same location (Class 4). Class 1 labels were\nused to calculate labeller-specific quality scores, while Class 1 and 4 sites\nmapped by at least 3 labellers were used to further evaluate label uncertainty\nusing a Bayesian risk metric. Quality metrics showed that label quality was\nmoderately high (0.75) for measures of total field extent, but low regarding\nthe number of individual fields delineated (0.33), and the position of field\nedges (0.05). These values are expected when delineating small-scale fields in\n3-5 m resolution imagery, which can be too coarse to reliably distinguish\nsmaller fields, particularly in dense croplands, and therefore requires\nsubstantial labeller judgement. Nevertheless, previous work shows that such\nlabels can train effective field mapping models. Furthermore, this large,\nprobabilistic sample on its own provides valuable insight into regional\nagricultural characteristics, highlighting variations in the median field size\nand density. The imagery and vectorized labels along with quality information\nis available for download from two public repositories.\n","authors":["L. D. Estes","A. Wussah","M. Asipunu","M. Gathigi","P. Kovačič","J. Muhando","B. V. Yeboah","F. K. Addai","E. S. Akakpo","M. K. Allotey","P. Amkoya","E. Amponsem","K. D. Donkoh","N. Ha","E. Heltzel","C. Juma","R. Mdawida","A. Miroyo","J. Mucha","J. Mugami","F. Mwawaza","D. A. Nyarko","P. Oduor","K. N. Ohemeng","S. I. D. Segbefia","T. Tumbula","F. Wambua","G. H. Xeflide","S. Ye","F. Yeboah"],"pdf_url":"https://arxiv.org/pdf/2412.18483v1.pdf","comment":"22 pages, 8 figures"},{"id":"http://arxiv.org/abs/2408.11475v2","updated":"2024-12-24T14:46:25Z","published":"2024-08-21T09:42:04Z","title":"TrackGo: A Flexible and Efficient Method for Controllable Video\n Generation","summary":" Recent years have seen substantial progress in diffusion-based controllable\nvideo generation. However, achieving precise control in complex scenarios,\nincluding fine-grained object parts, sophisticated motion trajectories, and\ncoherent background movement, remains a challenge. In this paper, we introduce\nTrackGo, a novel approach that leverages free-form masks and arrows for\nconditional video generation. This method offers users with a flexible and\nprecise mechanism for manipulating video content. We also propose the\nTrackAdapter for control implementation, an efficient and lightweight adapter\ndesigned to be seamlessly integrated into the temporal self-attention layers of\na pretrained video generation model. This design leverages our observation that\nthe attention map of these layers can accurately activate regions corresponding\nto motion in videos. Our experimental results demonstrate that our new\napproach, enhanced by the TrackAdapter, achieves state-of-the-art performance\non key metrics such as FVD, FID, and ObjMC scores.\n","authors":["Haitao Zhou","Chuang Wang","Rui Nie","Jinlin Liu","Dongdong Yu","Qian Yu","Changhu Wang"],"pdf_url":"https://arxiv.org/pdf/2408.11475v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.18459v1","updated":"2024-12-24T14:32:27Z","published":"2024-12-24T14:32:27Z","title":"Underwater Image Restoration via Polymorphic Large Kernel CNNs","summary":" Underwater Image Restoration (UIR) remains a challenging task in computer\nvision due to the complex degradation of images in underwater environments.\nWhile recent approaches have leveraged various deep learning techniques,\nincluding Transformers and complex, parameter-heavy models to achieve\nsignificant improvements in restoration effects, we demonstrate that pure CNN\narchitectures with lightweight parameters can achieve comparable results. In\nthis paper, we introduce UIR-PolyKernel, a novel method for underwater image\nrestoration that leverages Polymorphic Large Kernel CNNs. Our approach uniquely\ncombines large kernel convolutions of diverse sizes and shapes to effectively\ncapture long-range dependencies within underwater imagery. Additionally, we\nintroduce a Hybrid Domain Attention module that integrates frequency and\nspatial domain attention mechanisms to enhance feature importance. By\nleveraging the frequency domain, we can capture hidden features that may not be\nperceptible to humans but are crucial for identifying patterns in both\nunderwater and on-air images. This approach enhances the generalization and\nrobustness of our UIR model. Extensive experiments on benchmark datasets\ndemonstrate that UIR-PolyKernel achieves state-of-the-art performance in\nunderwater image restoration tasks, both quantitatively and qualitatively. Our\nresults show that well-designed pure CNN architectures can effectively compete\nwith more complex models, offering a balance between performance and\ncomputational efficiency. This work provides new insights into the potential of\nCNN-based approaches for challenging image restoration tasks in underwater\nenvironments. The code is available at\n\\href{https://github.com/CXH-Research/UIR-PolyKernel}{https://github.com/CXH-Research/UIR-PolyKernel}.\n","authors":["Xiaojiao Guo","Yihang Dong","Xuhang Chen","Weiwen Chen","Zimeng Li","FuChen Zheng","Chi-Man Pun"],"pdf_url":"https://arxiv.org/pdf/2412.18459v1.pdf","comment":"Accepted by ICASSP2025"},{"id":"http://arxiv.org/abs/2412.18450v1","updated":"2024-12-24T14:21:58Z","published":"2024-12-24T14:21:58Z","title":"3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D\n Scene Understanding","summary":" A 3D scene graph represents a compact scene model, storing information about\nthe objects and the semantic relationships between them, making its use\npromising for robotic tasks. When interacting with a user, an embodied\nintelligent agent should be capable of responding to various queries about the\nscene formulated in natural language. Large Language Models (LLMs) are\nbeneficial solutions for user-robot interaction due to their natural language\nunderstanding and reasoning abilities. Recent methods for creating learnable\nrepresentations of 3D scenes have demonstrated the potential to improve the\nquality of LLMs responses by adapting to the 3D world. However, the existing\nmethods do not explicitly utilize information about the semantic relationships\nbetween objects, limiting themselves to information about their coordinates. In\nthis work, we propose a method 3DGraphLLM for constructing a learnable\nrepresentation of a 3D scene graph. The learnable representation is used as\ninput for LLMs to perform 3D vision-language tasks. In our experiments on\npopular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap\ndatasets, we demonstrate the advantage of this approach over baseline methods\nthat do not use information about the semantic relationships between objects.\nThe code is publicly available at\nhttps://github.com/CognitiveAISystems/3DGraphLLM.\n","authors":["Tatiana Zemskova","Dmitry Yudin"],"pdf_url":"https://arxiv.org/pdf/2412.18450v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17496v2","updated":"2024-12-24T14:18:19Z","published":"2024-12-23T11:53:06Z","title":"Guided Real Image Dehazing using YCbCr Color Space","summary":" Image dehazing, particularly with learning-based methods, has gained\nsignificant attention due to its importance in real-world applications.\nHowever, relying solely on the RGB color space often fall short, frequently\nleaving residual haze. This arises from two main issues: the difficulty in\nobtaining clear textural features from hazy RGB images and the complexity of\nacquiring real haze/clean image pairs outside controlled environments like\nsmoke-filled scenes. To address these issues, we first propose a novel\nStructure Guided Dehazing Network (SGDN) that leverages the superior structural\nproperties of YCbCr features over RGB. It comprises two key modules: Bi-Color\nGuidance Bridge (BGB) and Color Enhancement Module (CEM). BGB integrates a\nphase integration module and an interactive attention module, utilizing the\nrich texture features of the YCbCr space to guide the RGB space, thereby\nrecovering clearer features in both frequency and spatial domains. To maintain\ntonal consistency, CEM further enhances the color perception of RGB features by\naggregating YCbCr channel information. Furthermore, for effective supervised\nlearning, we introduce a Real-World Well-Aligned Haze (RW$^2$AH) dataset, which\nincludes a diverse range of scenes from various geographical regions and\nclimate conditions. Experimental results demonstrate that our method surpasses\nexisting state-of-the-art methods across multiple real-world smoke/haze\ndatasets. Code and Dataset:\n\\textcolor{blue}{\\url{https://github.com/fiwy0527/AAAI25_SGDN.}}\n","authors":["Wenxuan Fang","Junkai Fan","Yu Zheng","Jiangwei Weng","Ying Tai","Jun Li"],"pdf_url":"https://arxiv.org/pdf/2412.17496v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.07795v3","updated":"2024-12-24T14:10:58Z","published":"2024-10-10T10:24:59Z","title":"Optimal-state Dynamics Estimation for Physics-based Human Motion Capture\n from Videos","summary":" Human motion capture from monocular videos has made significant progress in\nrecent years. However, modern approaches often produce temporal artifacts, e.g.\nin form of jittery motion and struggle to achieve smooth and physically\nplausible motions. Explicitly integrating physics, in form of internal forces\nand exterior torques, helps alleviating these artifacts. Current\nstate-of-the-art approaches make use of an automatic PD controller to predict\ntorques and reaction forces in order to re-simulate the input kinematics, i.e.\nthe joint angles of a predefined skeleton. However, due to imperfect physical\nmodels, these methods often require simplifying assumptions and extensive\npreprocessing of the input kinematics to achieve good performance. To this end,\nwe propose a novel method to selectively incorporate the physics models with\nthe kinematics observations in an online setting, inspired by a neural\nKalman-filtering approach. We develop a control loop as a meta-PD controller to\npredict internal joint torques and external reaction forces, followed by a\nphysics-based motion simulation. A recurrent neural network is introduced to\nrealize a Kalman filter that attentively balances the kinematics input and\nsimulated motion, resulting in an optimal-state dynamics prediction. We show\nthat this filtering step is crucial to provide an online supervision that helps\nbalancing the shortcoming of the respective input motions, thus being important\nfor not only capturing accurate global motion trajectories but also producing\nphysically plausible human poses. The proposed approach excels in the\nphysics-based human pose estimation task and demonstrates the physical\nplausibility of the predictive dynamics, compared to state of the art. The code\nis available on https://github.com/cuongle1206/OSDCap\n","authors":["Cuong Le","Viktor Johansson","Manon Kok","Bastian Wandt"],"pdf_url":"https://arxiv.org/pdf/2410.07795v3.pdf","comment":"17 pages, 7 figure, NeurIPS 2024"},{"id":"http://arxiv.org/abs/2404.09292v2","updated":"2024-12-24T14:07:46Z","published":"2024-04-14T15:58:35Z","title":"Bridging Data Islands: Geographic Heterogeneity-Aware Federated Learning\n for Collaborative Remote Sensing Semantic Segmentation","summary":" Remote sensing semantic segmentation (RSS) is an essential technology in\nearth observation missions. Due to concerns over geographic information\nsecurity, data privacy, storage bottleneck and industry competition,\nhigh-quality annotated remote sensing images are often isolated and distributed\nacross institutions. The issue of remote sensing data islands poses challenges\nfor fully utilizing isolated datasets to train a global model. Federated\nlearning (FL), a privacy-preserving distributed collaborative learning\ntechnology, offers a potential solution to leverage isolated remote sensing\ndata. Typically, remote sensing images from different institutions exhibit\nsignificant geographic heterogeneity, characterized by coupled\nclass-distribution heterogeneity and object-appearance heterogeneity. However,\nexisting FL methods lack consideration of them, leading to a decline in the\nperformance of the global model when FL is directly applied to RSS. We propose\na novel Geographic heterogeneity-aware Federated learning (GeoFed) framework to\nbridge data islands in RSS. Our framework consists of three modules, including\nthe Global Insight Enhancement (GIE) module, the Essential Feature Mining (EFM)\nmodule and the Local-Global Balance (LoGo) module. Through the GIE module,\nclass distribution heterogeneity is alleviated by introducing a prior global\nclass distribution vector. We design an EFM module to alleviate object\nappearance heterogeneity by constructing essential features. Furthermore, the\nLoGo module enables the model to possess both global generalization capability\nand local adaptation. Extensive experiments on three public datasets (i.e.,\nFedFBP, FedCASID, FedInria) demonstrate that our GeoFed framework consistently\noutperforms the current state-of-the-art methods.\n","authors":["Jieyi Tan","Yansheng Li","Sergey A. Bartalev","Shinkarenko Stanislav","Bo Dang","Yongjun Zhang","Liangqi Yuan","Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2404.09292v2.pdf","comment":"19 pages,12 figures, 10 tables"},{"id":"http://arxiv.org/abs/2412.15618v2","updated":"2024-12-24T14:07:12Z","published":"2024-12-20T07:22:41Z","title":"3D Shape Tokenization","summary":" We introduce Shape Tokens, a 3D representation that is continuous, compact,\nand easy to incorporate into machine learning models. Shape Tokens act as\nconditioning vectors that represent shape information in a 3D flow-matching\nmodel. The flow-matching model is trained to approximate probability density\nfunctions corresponding to delta functions concentrated on the surfaces of\nshapes in 3D. By attaching Shape Tokens to various machine learning models, we\ncan generate new shapes, convert images to 3D, align 3D shapes with text and\nimages, and render shapes directly at variable, user specified, resolution.\nMoreover, Shape Tokens enable a systematic analysis of geometric properties\nsuch as normal, density, and deformation field. Across all tasks and\nexperiments, utilizing Shape Tokens demonstrate strong performance compared to\nexisting baselines.\n","authors":["Jen-Hao Rick Chang","Yuyang Wang","Miguel Angel Bautista Martin","Jiatao Gu","Josh Susskind","Oncel Tuzel"],"pdf_url":"https://arxiv.org/pdf/2412.15618v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.01904v3","updated":"2024-12-24T13:58:21Z","published":"2024-11-04T09:15:21Z","title":"FPPL: An Efficient and Non-IID Robust Federated Continual Learning\n Framework","summary":" Federated continual learning (FCL) aims to learn from sequential data stream\nin the decentralized federated learning setting, while simultaneously\nmitigating the catastrophic forgetting issue in classical continual learning.\nExisting FCL methods usually employ typical rehearsal mechanisms, which could\nresult in privacy violations or additional onerous storage and computational\nburdens. In this work, an efficient and non-IID robust federated continual\nlearning framework, called Federated Prototype-Augmented Prompt Learning\n(FPPL), is proposed. The FPPL can collaboratively learn lightweight prompts\naugmented by prototypes without rehearsal. On the client side, a fusion\nfunction is employed to fully leverage the knowledge contained in task-specific\nprompts for alleviating catastrophic forgetting. Additionally, global\nprototypes aggregated from the server are used to obtain unified representation\nthrough contrastive learning, mitigating the impact of non-IID-derived data\nheterogeneity. On the server side, locally uploaded prototypes are utilized to\nperform debiasing on the classifier, further alleviating the performance\ndegradation caused by both non-IID and catastrophic forgetting. Empirical\nevaluations demonstrate the effectiveness of FPPL, achieving notable\nperformance with an efficient design while remaining robust to diverse non-IID\ndegrees. Code is available at: https://github.com/ycheoo/FPPL.\n","authors":["Yuchen He","Chuyun Shen","Xiangfeng Wang","Bo Jin"],"pdf_url":"https://arxiv.org/pdf/2411.01904v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10734v2","updated":"2024-12-24T13:35:31Z","published":"2024-12-14T08:08:40Z","title":"OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous\n Driving","summary":" The rapid advancement of deep learning has intensified the need for\ncomprehensive data for use by autonomous driving algorithms. High-quality\ndatasets are crucial for the development of effective data-driven autonomous\ndriving solutions. Next-generation autonomous driving datasets must be\nmultimodal, incorporating data from advanced sensors that feature extensive\ndata coverage, detailed annotations, and diverse scene representation. To\naddress this need, we present OmniHD-Scenes, a large-scale multimodal dataset\nthat provides comprehensive omnidirectional high-definition data. The\nOmniHD-Scenes dataset combines data from 128-beam LiDAR, six cameras, and six\n4D imaging radar systems to achieve full environmental perception. The dataset\ncomprises 1501 clips, each approximately 30-s long, totaling more than 450K\nsynchronized frames and more than 5.85 million synchronized sensor data points.\nWe also propose a novel 4D annotation pipeline. To date, we have annotated 200\nclips with more than 514K precise 3D bounding boxes. These clips also include\nsemantic segmentation annotations for static scene elements. Additionally, we\nintroduce a novel automated pipeline for generation of the dense occupancy\nground truth, which effectively leverages information from non-key frames.\nAlongside the proposed dataset, we establish comprehensive evaluation metrics,\nbaseline models, and benchmarks for 3D detection and semantic occupancy\nprediction. These benchmarks utilize surround-view cameras and 4D imaging radar\nto explore cost-effective sensor solutions for autonomous driving applications.\nExtensive experiments demonstrate the effectiveness of our low-cost sensor\nconfiguration and its robustness under adverse conditions. Data will be\nreleased at https://www.2077ai.com/OmniHD-Scenes.\n","authors":["Lianqing Zheng","Long Yang","Qunshu Lin","Wenjin Ai","Minghao Liu","Shouyi Lu","Jianan Liu","Hongze Ren","Jingyue Mo","Xiaokai Bai","Jie Bai","Zhixiong Ma","Xichan Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.10734v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18421v1","updated":"2024-12-24T13:27:25Z","published":"2024-12-24T13:27:25Z","title":"Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion\n Models","summary":" Image generation in the fashion domain has predominantly focused on\npreserving body characteristics or following input prompts, but little\nattention has been paid to improving the inherent fashionability of the output\nimages. This paper presents a novel diffusion model-based approach that\ngenerates fashion images with improved fashionability while maintaining control\nover key attributes. Key components of our method include: 1) fashionability\nenhancement, which ensures that the generated images are more fashionable than\nthe input; 2) preservation of body characteristics, encouraging the generated\nimages to maintain the original shape and proportions of the input; and 3)\nautomatic fashion optimization, which does not rely on manual input or external\nprompts. We also employ two methods to collect training data for guidance while\ngenerating and evaluating the images. In particular, we rate outfit images\nusing fashionability scores annotated by multiple fashion experts through\nOpenSkill-based and five critical aspect-based pairwise comparisons. These\nmethods provide complementary perspectives for assessing and improving the\nfashionability of the generated images. The experimental results show that our\napproach outperforms the baseline Fashion++ in generating images with superior\nfashionability, demonstrating its effectiveness in producing more stylish and\nappealing fashion images.\n","authors":["Qice Qin","Yuki Hirakawa","Ryotaro Shimizu","Takuya Furusawa","Edgar Simo-Serra"],"pdf_url":"https://arxiv.org/pdf/2412.18421v1.pdf","comment":"11 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.18417v1","updated":"2024-12-24T13:18:00Z","published":"2024-12-24T13:18:00Z","title":"Ultra-Low Complexity On-Orbit Compression for Remote Sensing Imagery via\n Block Modulated Imaging","summary":" The growing field of remote sensing faces a challenge: the ever-increasing\nsize and volume of imagery data are exceeding the storage and transmission\ncapabilities of satellite platforms. Efficient compression of remote sensing\nimagery is a critical solution to alleviate these burdens on satellites.\nHowever, existing compression methods are often too computationally expensive\nfor satellites. With the continued advancement of compressed sensing theory,\nsingle-pixel imaging emerges as a powerful tool that brings new possibilities\nfor on-orbit image compression. However, it still suffers from prolonged\nimaging times and the inability to perform high-resolution imaging, hindering\nits practical application. This paper advances the study of compressed sensing\nin remote sensing image compression, proposing Block Modulated Imaging (BMI).\nBy requiring only a single exposure, BMI significantly enhances imaging\nacquisition speeds. Additionally, BMI obviates the need for digital micromirror\ndevices and surpasses limitations in image resolution. Furthermore, we propose\na novel decoding network specifically designed to reconstruct images compressed\nunder the BMI framework. Leveraging the gated 3D convolutions and promoting\nefficient information flow across stages through a Two-Way Cross-Attention\nmodule, our decoding network exhibits demonstrably superior reconstruction\nperformance. Extensive experiments conducted on multiple renowned remote\nsensing datasets unequivocally demonstrate the efficacy of our proposed method.\nTo further validate its practical applicability, we developed and tested a\nprototype of the BMI-based camera, which has shown promising potential for\non-orbit image compression. The code is available at\nhttps://github.com/Johnathan218/BMNet.\n","authors":["Zhibin Wang","Yanxin Cai","Jiayi Zhou","Yangming Zhang","Tianyu Li","Wei Li","Xun Liu","Guoqing Wang","Yang Yang"],"pdf_url":"https://arxiv.org/pdf/2412.18417v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10489v2","updated":"2024-12-24T13:03:44Z","published":"2024-12-13T16:27:54Z","title":"CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With\n Multimodal Information","summary":" Electroencephalogram (EEG) signals have attracted significant attention from\nresearchers due to their non-invasive nature and high temporal sensitivity in\ndecoding visual stimuli. However, most recent studies have focused solely on\nthe relationship between EEG and image data pairs, neglecting the valuable\n``beyond-image-modality\" information embedded in EEG signals. This results in\nthe loss of critical multimodal information in EEG. To address this limitation,\nwe propose CognitionCapturer, a unified framework that fully leverages\nmultimodal data to represent EEG signals. Specifically, CognitionCapturer\ntrains Modality Expert Encoders for each modality to extract cross-modal\ninformation from the EEG modality. Then, it introduces a diffusion prior to map\nthe EEG embedding space to the CLIP embedding space, followed by using a\npretrained generative model, the proposed framework can reconstruct visual\nstimuli with high semantic and structural fidelity. Notably, the framework does\nnot require any fine-tuning of the generative models and can be extended to\nincorporate more modalities. Through extensive experiments, we demonstrate that\nCognitionCapturer outperforms state-of-the-art methods both qualitatively and\nquantitatively. Code: https://github.com/XiaoZhangYES/CognitionCapturer.\n","authors":["Kaifan Zhang","Lihuo He","Xin Jiang","Wen Lu","Di Wang","Xinbo Gao"],"pdf_url":"https://arxiv.org/pdf/2412.10489v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.04707v3","updated":"2024-12-24T12:58:48Z","published":"2024-11-07T13:45:23Z","title":"From CNN to CNN + RNN: Adapting Visualization Techniques for Time-Series\n Anomaly Detection","summary":" Deep neural networks are highly effective in solving complex problems but are\noften viewed as \"black boxes,\" limiting their adoption in contexts where\ntransparency and explainability are essential. This lack of visibility raises\nethical and legal concerns, particularly in critical areas like security, where\nautomated decisions can have significant consequences. The General Data\nProtection Regulation (GDPR) underscores the importance of justifying these\ndecisions. In this work, we explore visualization techniques to improve the\nunderstanding of anomaly detection models based on convolutional recurrent\nneural networks (CNN + RNN) with a TimeDistributed layer. Our model combines\nVGG19 for convolutional feature extraction and a GRU layer for sequential\nanalysis of real-time video data. While suitable for temporal data, this\nstructure complicates gradient propagation, as sequence elements are processed\nindependently, dissociating temporal information. We adapt visualization\ntechniques such as saliency maps and Grad-CAM to address these challenges. This\narticle highlights the difficulties in visually interpreting video-based models\nand demonstrates how techniques for static images can be adapted to recurrent\narchitectures, offering a transitional solution in the absence of dedicated\nmethods.\n","authors":["Fabien Poirier"],"pdf_url":"https://arxiv.org/pdf/2411.04707v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18409v1","updated":"2024-12-24T12:55:31Z","published":"2024-12-24T12:55:31Z","title":"Re-assessing ImageNet: How aligned is its single-label assumption with\n its multi-label nature?","summary":" ImageNet, an influential dataset in computer vision, is traditionally\nevaluated using single-label classification, which assumes that an image can be\nadequately described by a single concept or label. However, this approach may\nnot fully capture the complex semantics within the images available in\nImageNet, potentially hindering the development of models that effectively\nlearn these intricacies. This study critically examines the prevalent\nsingle-label benchmarking approach and advocates for a shift to multi-label\nbenchmarking for ImageNet. This shift would enable a more comprehensive\nassessment of the capabilities of deep neural network (DNN) models. We analyze\nthe effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of\nits variants, ImageNetV2. Studies in the literature have reported unexpected\naccuracy drops of 11% to 14% on ImageNetV2. Our findings show that these\nreported declines are largely attributable to a characteristic of the dataset\nthat has not received sufficient attention -- the proportion of images with\nmultiple labels. Taking this characteristic into account, the results of our\nexperiments provide evidence that there is no substantial degradation in\neffectiveness on ImageNetV2. Furthermore, we acknowledge that ImageNet\npre-trained models exhibit some capability at capturing the multi-label nature\nof the dataset even though they were trained under the single-label assumption.\nConsequently, we propose a new evaluation approach to augment existing\napproaches that assess this capability. Our findings highlight the importance\nof considering the multi-label nature of the ImageNet dataset during\nbenchmarking. Failing to do so could lead to incorrect conclusions regarding\nthe effectiveness of DNNs and divert research efforts from addressing other\nsubstantial challenges related to the reliability and robustness of these\nmodels.\n","authors":["Esla Timothy Anzaku","Seyed Amir Mousavi","Arnout Van Messem","Wesley De Neve"],"pdf_url":"https://arxiv.org/pdf/2412.18409v1.pdf","comment":"20 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.18406v1","updated":"2024-12-24T12:52:16Z","published":"2024-12-24T12:52:16Z","title":"How accurate is mechanobiology?","summary":" Mechanobiology is gaining more and more traction as the fundamental role of\nphysical forces in biological function becomes clearer. Forces at the\nmicroscale are often measured indirectly using inverse problems such as\nTraction Force Microscopy because biological experiments are hard to access\nwith physical probes. In contrast with the experimental nature of biology and\nphysics, these measurements do not come with error bars, confidence regions, or\np-values. The aim of this manuscript is to publicize this issue and to propose\na first step towards a remedy in the form of a general reconstruction framework\nthat enables hypothesis testing.\n","authors":["Aleix Boquet-Pujadas"],"pdf_url":"https://arxiv.org/pdf/2412.18406v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18404v1","updated":"2024-12-24T12:51:05Z","published":"2024-12-24T12:51:05Z","title":"Extract Free Dense Misalignment from CLIP","summary":" Recent vision-language foundation models still frequently produce outputs\nmisaligned with their inputs, evidenced by object hallucination in captioning\nand prompt misalignment in the text-to-image generation model. Recent studies\nhave explored methods for identifying misaligned elements, aiming not only to\nenhance interpretability but also to improve model performance. However,\ncurrent approaches primarily rely on large foundation models in a zero-shot\nmanner or fine-tuned models with human annotations, which limits scalability\ndue to significant computational costs. This work proposes a novel approach,\ndubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP,\nspecifically focusing on pinpointing misaligned words between image and text.\nWe carefully revamp the gradient-based attribution computation method, enabling\nnegative gradient of individual text tokens to indicate misalignment. We also\npropose F-CLIPScore, which aggregates misaligned attributions with a global\nalignment score. We evaluate our method on various dense misalignment detection\nbenchmarks, covering various image and text domains and misalignment types. Our\nmethod demonstrates state-of-the-art performance among zero-shot models and\ncompetitive performance with fine-tuned models while maintaining superior\nefficiency. Our qualitative examples show that our method has a unique strength\nto detect entity-level objects, intangible objects, and attributes that can not\nbe easily detected for existing works. We conduct ablation studies and analyses\nto highlight the strengths and limitations of our approach. Our code is\npublicly available at https://github.com/naver-ai/CLIP4DM.\n","authors":["JeongYeon Nam","Jinbae Im","Wonjae Kim","Taeho Kil"],"pdf_url":"https://arxiv.org/pdf/2412.18404v1.pdf","comment":"16 pages, 14 figures, AAAI 2025"},{"id":"http://arxiv.org/abs/2412.18390v1","updated":"2024-12-24T12:28:19Z","published":"2024-12-24T12:28:19Z","title":"RDPM: Solve Diffusion Probabilistic Models via Recurrent Token\n Prediction","summary":" Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach\nfor high-fidelity image synthesis, operating diffusion processes on continuous\nVAE latent, which significantly differ from the text generation methods\nemployed by Large Language Models (LLMs). In this paper, we introduce a novel\ngenerative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which\nenhances the diffusion process through a recurrent token prediction mechanism,\nthereby pioneering the field of Discrete Diffusion. By progressively\nintroducing Gaussian noise into the latent representations of images and\nencoding them into vector-quantized tokens in a recurrent manner, RDPM\nfacilitates a unique diffusion process on discrete-value domains. This process\niteratively predicts the token codes for subsequent timesteps, transforming the\ninitial standard Gaussian noise into the source data distribution, aligning\nwith GPT-style models in terms of the loss function. RDPM demonstrates superior\nperformance while benefiting from the speed advantage of requiring only a few\ninference steps. This model not only leverages the diffusion process to ensure\nhigh-quality generation but also converts continuous signals into a series of\nhigh-fidelity discrete tokens, thereby maintaining a unified optimization\nstrategy with other discrete tokens, such as text. We anticipate that this work\nwill contribute to the development of a unified model for multimodal\ngeneration, specifically by integrating continuous signal domains such as\nimages, videos, and audio with text. We will release the code and model weights\nto the open-source community.\n","authors":["Wu Xiaoping","Hu Jie","Wei Xiaoming"],"pdf_url":"https://arxiv.org/pdf/2412.18390v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.18386v1","updated":"2024-12-24T12:16:43Z","published":"2024-12-24T12:16:43Z","title":"Switch-a-View: Few-Shot View Selection Learned from Edited Videos","summary":" We introduce Switch-a-View, a model that learns to automatically select the\nviewpoint to display at each timepoint when creating a how-to video. The key\ninsight of our approach is how to train such a model from unlabeled--but\nhuman-edited--video samples. We pose a pretext task that pseudo-labels segments\nin the training videos for their primary viewpoint (egocentric or exocentric),\nand then discovers the patterns between those view-switch moments on the one\nhand and the visual and spoken content in the how-to video on the other hand.\nArmed with this predictor, our model then takes an unseen multi-view video as\ninput and orchestrates which viewpoint should be displayed when. We further\nintroduce a few-shot training setting that permits steering the model towards a\nnew data domain. We demonstrate our idea on a variety of real-world video from\nHowTo100M and Ego-Exo4D and rigorously validate its advantages.\n","authors":["Sagnik Majumder","Tushar Nagarajan","Ziad Al-Halah","Kristen Grauman"],"pdf_url":"https://arxiv.org/pdf/2412.18386v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18380v1","updated":"2024-12-24T12:08:50Z","published":"2024-12-24T12:08:50Z","title":"RSGaussian:3D Gaussian Splatting with LiDAR for Aerial Remote Sensing\n Novel View Synthesis","summary":" This study presents RSGaussian, an innovative novel view synthesis (NVS)\nmethod for aerial remote sensing scenes that incorporate LiDAR point cloud as\nconstraints into the 3D Gaussian Splatting method, which ensures that Gaussians\ngrow and split along geometric benchmarks, addressing the overgrowth and\nfloaters issues occurs. Additionally, the approach introduces coordinate\ntransformations with distortion parameters for camera models to achieve\npixel-level alignment between LiDAR point clouds and 2D images, facilitating\nheterogeneous data fusion and achieving the high-precision geo-alignment\nrequired in aerial remote sensing. Depth and plane consistency losses are\nincorporated into the loss function to guide Gaussians towards real depth and\nplane representations, significantly improving depth estimation accuracy.\nExperimental results indicate that our approach has achieved novel view\nsynthesis that balances photo-realistic visual quality and high-precision\ngeometric estimation under aerial remote sensing datasets. Finally, we have\nalso established and open-sourced a dense LiDAR point cloud dataset along with\nits corresponding aerial multi-view images, AIR-LONGYAN.\n","authors":["Yiling Yao","Wenjuan Zhang","Bing Zhang","Bocheng Li","Yaning Wang","Bowen Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18380v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04814v2","updated":"2024-12-24T11:57:46Z","published":"2024-12-06T07:16:14Z","title":"LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment","summary":" Recent advancements in text-to-video (T2V) generative models have shown\nimpressive capabilities. However, these models are still inadequate in aligning\nsynthesized videos with human preferences (e.g., accurately reflecting text\ndescriptions), which is particularly difficult to address, as human preferences\nare inherently subjective and challenging to formalize as objective functions.\nTherefore, this paper proposes LiFT, a novel fine-tuning method leveraging\nhuman feedback for T2V model alignment. Specifically, we first construct a\nHuman Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k\nhuman annotations, each including a score and its corresponding rationale.\nBased on this, we train a reward model LiFT-Critic to learn reward function\neffectively, which serves as a proxy for human judgment, measuring the\nalignment between given videos and human expectations. Lastly, we leverage the\nlearned reward function to align the T2V model by maximizing the\nreward-weighted likelihood. As a case study, we apply our pipeline to\nCogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B\nacross all 16 metrics, highlighting the potential of human feedback in\nimproving the alignment and quality of synthesized videos.\n","authors":["Yibin Wang","Zhiyu Tan","Junyan Wang","Xiaomeng Yang","Cheng Jin","Hao Li"],"pdf_url":"https://arxiv.org/pdf/2412.04814v2.pdf","comment":"Project page: https://codegoat24.github.io/LiFT"},{"id":"http://arxiv.org/abs/2412.12716v3","updated":"2024-12-24T11:42:13Z","published":"2024-12-17T09:30:31Z","title":"Unsupervised UAV 3D Trajectories Estimation with Sparse Point Clouds","summary":" Compact UAV systems, while advancing delivery and surveillance, pose\nsignificant security challenges due to their small size, which hinders\ndetection by traditional methods. This paper presents a cost-effective,\nunsupervised UAV detection method using spatial-temporal sequence processing to\nfuse multiple LiDAR scans for accurate UAV tracking in real-world scenarios.\nOur approach segments point clouds into foreground and background, analyzes\nspatial-temporal data, and employs a scoring mechanism to enhance detection\naccuracy. Tested on a public dataset, our solution placed 4th in the CVPR 2024\nUG2+ Challenge, demonstrating its practical effectiveness. We plan to\nopen-source all designs, code, and sample data for the research community\ngithub.com/lianghanfang/UnLiDAR-UAV-Est.\n","authors":["Hanfang Liang","Yizhuo Yang","Jinming Hu","Jianfei Yang","Fen Liu","Shenghai Yuan"],"pdf_url":"https://arxiv.org/pdf/2412.12716v3.pdf","comment":"Paper Accepted for ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18355v1","updated":"2024-12-24T11:35:40Z","published":"2024-12-24T11:35:40Z","title":"Addressing Spatial-Temporal Data Heterogeneity in Federated Continual\n Learning via Tail Anchor","summary":" Federated continual learning (FCL) allows each client to continually update\nits knowledge from task streams, enhancing the applicability of federated\nlearning in real-world scenarios. However, FCL needs to address not only\nspatial data heterogeneity between clients but also temporal data heterogeneity\nbetween tasks. In this paper, empirical experiments demonstrate that such\ninput-level heterogeneity significantly affects the model's internal parameters\nand outputs, leading to severe spatial-temporal catastrophic forgetting of\nlocal and previous knowledge. To this end, we propose Federated Tail Anchor\n(FedTA) to mix trainable Tail Anchor with the frozen output features to adjust\ntheir position in the feature space, thereby overcoming parameter-forgetting\nand output-forgetting. Moreover, three novel components are also included in\nFedTA: Input Enhancement for improving the performance of pre-trained models on\ndownstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous\nlocal knowledge on the server side; and Best Global Prototype Selection for\nfinding the best anchor point for each class in the feature space. Extensive\nexperiments demonstrate that FedTA not only outperforms existing FCL methods\nbut also effectively preserves the relative positions of features, remaining\nunaffected by spatial and temporal changes.\n","authors":["Hao Yu","Xin Yang","Le Zhang","Hanlin Gu","Tianrui Li","Lixin Fan","Qiang Yang"],"pdf_url":"https://arxiv.org/pdf/2412.18355v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17808v2","updated":"2024-12-24T11:02:29Z","published":"2024-12-23T18:59:06Z","title":"Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders","summary":" Recent 3D content generation pipelines commonly employ Variational\nAutoencoders (VAEs) to encode shapes into compact latent representations for\ndiffusion-based generation. However, the widely adopted uniform point sampling\nstrategy in Shape VAE training often leads to a significant loss of geometric\ndetails, limiting the quality of shape reconstruction and downstream generation\ntasks. We present Dora-VAE, a novel approach that enhances VAE reconstruction\nthrough our proposed sharp edge sampling strategy and a dual cross-attention\nmechanism. By identifying and prioritizing regions with high geometric\ncomplexity during training, our method significantly improves the preservation\nof fine-grained shape features. Such sampling strategy and the dual attention\nmechanism enable the VAE to focus on crucial geometric details that are\ntypically missed by uniform sampling approaches. To systematically evaluate VAE\nreconstruction quality, we additionally propose Dora-bench, a benchmark that\nquantifies shape complexity through the density of sharp edges, introducing a\nnew metric focused on reconstruction accuracy at these salient geometric\nfeatures. Extensive experiments on the Dora-bench demonstrate that Dora-VAE\nachieves comparable reconstruction quality to the state-of-the-art dense\nXCube-VAE while requiring a latent space at least 8$\\times$ smaller (1,280 vs.\n> 10,000 codes). We will release our code and benchmark dataset to facilitate\nfuture research in 3D shape modeling.\n","authors":["Rui Chen","Jianfeng Zhang","Yixun Liang","Guan Luo","Weiyu Li","Jiarui Liu","Xiu Li","Xiaoxiao Long","Jiashi Feng","Ping Tan"],"pdf_url":"https://arxiv.org/pdf/2412.17808v2.pdf","comment":"Project page: https://aruichen.github.io/Dora/"},{"id":"http://arxiv.org/abs/2412.18342v1","updated":"2024-12-24T11:00:23Z","published":"2024-12-24T11:00:23Z","title":"Mitigating Label Noise using Prompt-Based Hyperbolic Meta-Learning in\n Open-Set Domain Generalization","summary":" Open-Set Domain Generalization (OSDG) is a challenging task requiring models\nto accurately predict familiar categories while minimizing confidence for\nunknown categories to effectively reject them in unseen domains. While the OSDG\nfield has seen considerable advancements, the impact of label noise--a common\nissue in real-world datasets--has been largely overlooked. Label noise can\nmislead model optimization, thereby exacerbating the challenges of open-set\nrecognition in novel domains. In this study, we take the first step towards\naddressing Open-Set Domain Generalization under Noisy Labels (OSDG-NL) by\nconstructing dedicated benchmarks derived from widely used OSDG datasets,\nincluding PACS and DigitsDG. We evaluate baseline approaches by integrating\ntechniques from both label denoising and OSDG methodologies, highlighting the\nlimitations of existing strategies in handling label noise effectively. To\naddress these limitations, we propose HyProMeta, a novel framework that\nintegrates hyperbolic category prototypes for label noise-aware meta-learning\nalongside a learnable new-category agnostic prompt designed to enhance\ngeneralization to unseen classes. Our extensive experiments demonstrate the\nsuperior performance of HyProMeta compared to state-of-the-art methods across\nthe newly established benchmarks. The source code of this work is released at\nhttps://github.com/KPeng9510/HyProMeta.\n","authors":["Kunyu Peng","Di Wen","Sarfraz M. Saquib","Yufan Chen","Junwei Zheng","David Schneider","Kailun Yang","Jiamin Wu","Alina Roitberg","Rainer Stiefelhagen"],"pdf_url":"https://arxiv.org/pdf/2412.18342v1.pdf","comment":"The source code of this work is released at\n https://github.com/KPeng9510/HyProMeta"},{"id":"http://arxiv.org/abs/2406.16540v3","updated":"2024-12-24T10:42:39Z","published":"2024-06-24T11:20:44Z","title":"Improving robustness to corruptions with multiplicative weight\n perturbations","summary":" Deep neural networks (DNNs) excel on clean images but struggle with corrupted\nones. Incorporating specific corruptions into the data augmentation pipeline\ncan improve robustness to those corruptions but may harm performance on clean\nimages and other types of distortion. In this paper, we introduce an\nalternative approach that improves the robustness of DNNs to a wide range of\ncorruptions without compromising accuracy on clean images. We first demonstrate\nthat input perturbations can be mimicked by multiplicative perturbations in the\nweight space. Leveraging this, we propose Data Augmentation via Multiplicative\nPerturbation (DAMP), a training method that optimizes DNNs under random\nmultiplicative weight perturbations. We also examine the recently proposed\nAdaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs\nunder adversarial multiplicative weight perturbations. Experiments on image\nclassification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural\nnetwork architectures (ResNet50, ViT-S/16, ViT-B/16) show that DAMP enhances\nmodel generalization performance in the presence of corruptions across\ndifferent settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from\nscratch, reaching the top-1 error of 23.7% which is comparable to ResNet50\nwithout extensive data augmentations.\n","authors":["Trung Trinh","Markus Heinonen","Luigi Acerbi","Samuel Kaski"],"pdf_url":"https://arxiv.org/pdf/2406.16540v3.pdf","comment":"Published at NeurIPS 2024 (spotlight). Code is available at\n https://github.com/trungtrinh44/DAMP"},{"id":"http://arxiv.org/abs/2412.18335v1","updated":"2024-12-24T10:42:25Z","published":"2024-12-24T10:42:25Z","title":"FloNa: Floor Plan Guided Embodied Visual Navigation","summary":" Humans naturally rely on floor plans to navigate in unfamiliar environments,\nas they are readily available, reliable, and provide rich geometrical guidance.\nHowever, existing visual navigation settings overlook this valuable prior\nknowledge, leading to limited efficiency and accuracy. To eliminate this gap,\nwe introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the\nfirst attempt to incorporate floor plan into embodied visual navigation. While\nthe floor plan offers significant advantages, two key challenges emerge: (1)\nhandling the spatial inconsistency between the floor plan and the actual scene\nlayout for collision-free navigation, and (2) aligning observed images with the\nfloor plan sketch despite their distinct modalities. To address these\nchallenges, we propose FloDiff, a novel diffusion policy framework\nincorporating a localization module to facilitate alignment between the current\nobservation and the floor plan. We further collect $20k$ navigation episodes\nacross $117$ scenes in the iGibson simulator to support the training and\nevaluation. Extensive experiments demonstrate the effectiveness and efficiency\nof our framework in unfamiliar scenes using floor plan knowledge. Project\nwebsite: https://gauleejx.github.io/flona/.\n","authors":["Jiaxin Li","Weiqi Huang","Zan Wang","Wei Liang","Huijun Di","Feng Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18335v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.18327v1","updated":"2024-12-24T10:25:41Z","published":"2024-12-24T10:25:41Z","title":"HAUR: Human Annotation Understanding and Recognition Through Text-Heavy\n Images","summary":" Vision Question Answering (VQA) tasks use images to convey critical\ninformation to answer text-based questions, which is one of the most common\nforms of question answering in real-world scenarios. Numerous vision-text\nmodels exist today and have performed well on certain VQA tasks. However, these\nmodels exhibit significant limitations in understanding human annotations on\ntext-heavy images. To address this, we propose the Human Annotation\nUnderstanding and Recognition (HAUR) task. As part of this effort, we introduce\nthe Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which\nencompasses five common types of human annotations. Additionally, we developed\nand trained our model, OCR-Mix. Through comprehensive cross-model comparisons,\nour results demonstrate that OCR-Mix outperforms other models in this task. Our\ndataset and model will be released soon .\n","authors":["Yuchen Yang","Haoran Yan","Yanhao Chen","Qingqiang Wu","Qingqi Hong"],"pdf_url":"https://arxiv.org/pdf/2412.18327v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18321v1","updated":"2024-12-24T10:13:20Z","published":"2024-12-24T10:13:20Z","title":"Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive\n Human-Computer","summary":" This study mainly explores the application of natural gesture recognition\nbased on computer vision in human-computer interaction, aiming to improve the\nfluency and naturalness of human-computer interaction through gesture\nrecognition technology. In the fields of virtual reality, augmented reality and\nsmart home, traditional input methods have gradually failed to meet the needs\nof users for interactive experience. As an intuitive and convenient interaction\nmethod, gestures have received more and more attention. This paper proposes a\ngesture recognition method based on a three-dimensional hand skeleton model. By\nsimulating the three-dimensional spatial distribution of hand joints, a\nsimplified hand skeleton structure is constructed. By connecting the palm and\neach finger joint, a dynamic and static gesture model of the hand is formed,\nwhich further improves the accuracy and efficiency of gesture recognition.\nExperimental results show that this method can effectively recognize various\ngestures and maintain high recognition accuracy and real-time response\ncapabilities in different environments. In addition, combined with multimodal\ntechnologies such as eye tracking, the intelligence level of the gesture\nrecognition system can be further improved, bringing a richer and more\nintuitive user experience. In the future, with the continuous development of\ncomputer vision, deep learning and multimodal interaction technology, natural\ninteraction based on gestures will play an important role in a wider range of\napplication scenarios and promote revolutionary progress in human-computer\ninteraction.\n","authors":["Fenghua Shao","Tong Zhang","Shang Gao","Qi Sun","Liuqingqing Yang"],"pdf_url":"https://arxiv.org/pdf/2412.18321v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18319v1","updated":"2024-12-24T10:07:51Z","published":"2024-12-24T10:07:51Z","title":"Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via\n Collective Monte Carlo Tree Search","summary":" In this work, we aim to develop an MLLM that understands and solves questions\nby learning to create each intermediate step of the reasoning involved till the\nfinal answer. To this end, we propose Collective Monte Carlo Tree Search\n(CoMCTS), a new learning-to-reason method for MLLMs, which introduces the\nconcept of collective learning into ``tree search'' for effective and efficient\nreasoning-path searching and learning. The core idea of CoMCTS is to leverage\ncollective knowledge from multiple models to collaboratively conjecture, search\nand identify effective reasoning paths toward correct answers via four\niterative operations including Expansion, Simulation and Error Positioning,\nBackpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a\nmultimodal dataset with a tree of rich, explicit and well-defined reasoning\nnodes for each question. With Mulberry-260k, we perform collective SFT to train\nour model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and\nReflection capabilities. Extensive experiments demonstrate the superiority of\nour proposed methods on various benchmarks. Code will be available at\nhttps://github.com/HJYao00/Mulberry\n","authors":["Huanjin Yao","Jiaxing Huang","Wenhao Wu","Jingyi Zhang","Yibo Wang","Shunyu Liu","Yingjie Wang","Yuxin Song","Haocheng Feng","Li Shen","Dacheng Tao"],"pdf_url":"https://arxiv.org/pdf/2412.18319v1.pdf","comment":"Technical report"},{"id":"http://arxiv.org/abs/2301.04470v3","updated":"2024-12-24T09:44:03Z","published":"2023-01-10T08:15:35Z","title":"InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning","summary":" For scalable autonomous driving, a robust map-based localization system,\nindependent of GPS, is fundamental. To achieve such map-based localization,\nonline high-definition (HD) map construction plays a significant role in\naccurate estimation of the pose. Although recent advancements in online HD map\nconstruction have predominantly investigated on vectorized representation due\nto its effectiveness, they suffer from computational cost and fixed parametric\nmodel, which limit scalability. To alleviate these limitations, we propose a\nnovel HD map learning framework that leverages graph modeling. This framework\nis designed to learn the construction of diverse geometric shapes, thereby\nenhancing the scalability of HD map construction. Our approach involves\nrepresenting the map elements as an instance-level graph by decomposing them\ninto vertices and edges to facilitate accurate and efficient end-to-end\nvectorized HD map learning. Furthermore, we introduce an association strategy\nusing a Graph Neural Network to efficiently handle the complex geometry of\nvarious map elements, while maintaining scalability. Comprehensive experiments\non public open dataset show that our proposed network outperforms\nstate-of-the-art model by $1.6$ mAP. We further showcase the superior\nscalability of our approach compared to state-of-the-art methods, achieving a\n$4.8$ mAP improvement in long range configuration. Our code is available at\nhttps://github.com/juyebshin/InstaGraM.\n","authors":["Juyeb Shin","Hyeonjun Jeong","Francois Rameau","Dongsuk Kum"],"pdf_url":"https://arxiv.org/pdf/2301.04470v3.pdf","comment":"Code available at https://github.com/juyebshin/InstaGraM"},{"id":"http://arxiv.org/abs/2406.19353v2","updated":"2024-12-24T09:33:24Z","published":"2024-06-27T17:32:18Z","title":"CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative\n Object REarrangement","summary":" Understanding how humans cooperatively rearrange household objects is\ncritical for VR/AR and human-robot interaction. However, in-depth studies on\nmodeling these behaviors are under-researched due to the lack of relevant\ndatasets. We fill this gap by presenting CORE4D, a novel large-scale 4D\nhuman-object-human interaction dataset focusing on collaborative object\nrearrangement, which encompasses diverse compositions of various object\ngeometries, collaboration modes, and 3D scenes. With 1K human-object-human\nmotion sequences captured in the real world, we enrich CORE4D by contributing\nan iterative collaboration retargeting strategy to augment motions to a variety\nof novel objects. Leveraging this approach, CORE4D comprises a total of 11K\ncollaboration sequences spanning 3K real and virtual object shapes. Benefiting\nfrom extensive motion patterns provided by CORE4D, we benchmark two tasks\naiming at generating human-object interaction: human-object motion forecasting\nand interaction synthesis. Extensive experiments demonstrate the effectiveness\nof our collaboration retargeting strategy and indicate that CORE4D has posed\nnew challenges to existing human-object interaction generation methodologies.\n","authors":["Yun Liu","Chengwen Zhang","Ruofan Xing","Bingda Tang","Bowen Yang","Li Yi"],"pdf_url":"https://arxiv.org/pdf/2406.19353v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17038v2","updated":"2024-12-24T09:15:14Z","published":"2024-12-22T14:30:26Z","title":"ErasableMask: A Robust and Erasable Privacy Protection Scheme against\n Black-box Face Recognition Models","summary":" While face recognition (FR) models have brought remarkable convenience in\nface verification and identification, they also pose substantial privacy risks\nto the public. Existing facial privacy protection schemes usually adopt\nadversarial examples to disrupt face verification of FR models. However, these\nschemes often suffer from weak transferability against black-box FR models and\npermanently damage the identifiable information that cannot fulfill the\nrequirements of authorized operations such as forensics and authentication. To\naddress these limitations, we propose ErasableMask, a robust and erasable\nprivacy protection scheme against black-box FR models. Specifically, via\nrethinking the inherent relationship between surrogate FR models, ErasableMask\nintroduces a novel meta-auxiliary attack, which boosts black-box\ntransferability by learning more general features in a stable and balancing\noptimization strategy. It also offers a perturbation erasion mechanism that\nsupports the erasion of semantic perturbations in protected face without\ndegrading image quality. To further improve performance, ErasableMask employs a\ncurriculum learning strategy to mitigate optimization conflicts between\nadversarial attack and perturbation erasion. Extensive experiments on the\nCelebA-HQ and FFHQ datasets demonstrate that ErasableMask achieves the\nstate-of-the-art performance in transferability, achieving over 72% confidence\non average in commercial FR systems. Moreover, ErasableMask also exhibits\noutstanding perturbation erasion performance, achieving over 90% erasion\nsuccess rate.\n","authors":["Sipeng Shen","Yunming Zhang","Dengpan Ye","Xiuwen Shi","Long Tang","Haoran Duan","Jiacheng Deng","Ziyi Liu"],"pdf_url":"https://arxiv.org/pdf/2412.17038v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18303v1","updated":"2024-12-24T09:15:00Z","published":"2024-12-24T09:15:00Z","title":"Efficient and Context-Aware Label Propagation for Zero-/Few-Shot\n Training-Free Adaptation of Vision-Language Model","summary":" Vision-language models (VLMs) have revolutionized machine learning by\nleveraging large pre-trained models to tackle various downstream tasks. Despite\nimprovements in label, training, and data efficiency, many state-of-the-art\nVLMs still require task-specific hyperparameter tuning and fail to fully\nexploit test samples. To overcome these challenges, we propose a graph-based\napproach for label-efficient adaptation and inference. Our method dynamically\nconstructs a graph over text prompts, few-shot examples, and test samples,\nusing label propagation for inference without task-specific tuning. Unlike\nexisting zero-shot label propagation techniques, our approach requires no\nadditional unlabeled support set and effectively leverages the test sample\nmanifold through dynamic graph expansion. We further introduce a context-aware\nfeature re-weighting mechanism to improve task adaptation accuracy.\nAdditionally, our method supports efficient graph expansion, enabling real-time\ninductive inference. Extensive evaluations on downstream tasks, such as\nfine-grained categorization and out-of-distribution generalization, demonstrate\nthe effectiveness of our approach.\n","authors":["Yushu Li","Yongyi Su","Adam Goodge","Kui Jia","Xun Xu"],"pdf_url":"https://arxiv.org/pdf/2412.18303v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18302v1","updated":"2024-12-24T09:11:37Z","published":"2024-12-24T09:11:37Z","title":"FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models","summary":" Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the\ngeneration of high-quality images that align closely with textual descriptions.\nHowever, this progress has also raised concerns about their misuse for\npropaganda and other malicious activities. Recent studies reveal that attackers\ncan embed biases into these models through simple fine-tuning, causing them to\ngenerate targeted imagery when triggered by specific phrases. This underscores\nthe potential for T2I models to act as tools for disseminating propaganda,\nproducing images aligned with an attacker's objective for end-users.\n Building on this concept, we introduce FameBias, a T2I biasing attack that\nmanipulates the embeddings of input prompts to generate images featuring\nspecific public figures. Unlike prior methods, Famebias operates solely on the\ninput embedding vectors without requiring additional model training. We\nevaluate FameBias comprehensively using Stable Diffusion V2, generating a large\ncorpus of images based on various trigger nouns and target public figures. Our\nexperiments demonstrate that FameBias achieves a high attack success rate while\npreserving the semantic context of the original prompts across multiple\ntrigger-target pairs.\n","authors":["Jaechul Roh","Andrew Yuan","Jinsong Mao"],"pdf_url":"https://arxiv.org/pdf/2412.18302v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16956v2","updated":"2024-12-24T09:07:26Z","published":"2024-12-22T10:28:52Z","title":"Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning","summary":" As the scale of vision models continues to grow, Visual Prompt Tuning (VPT)\nhas emerged as a parameter-efficient transfer learning technique, noted for its\nsuperior performance compared to full fine-tuning. However, indiscriminately\napplying prompts to every layer without considering their inherent\ncorrelations, can cause significant disturbances, leading to suboptimal\ntransferability. Additionally, VPT disrupts the original self-attention\nstructure, affecting the aggregation of visual features, and lacks a mechanism\nfor explicitly mining discriminative visual features, which are crucial for\nclassification. To address these issues, we propose a Semantic Hierarchical\nPrompt (SHIP) fine-tuning strategy. We adaptively construct semantic\nhierarchies and use semantic-independent and semantic-shared prompts to learn\nhierarchical representations. We also integrate attribute prompts and a prompt\nmatching loss to enhance feature discrimination and employ decoupled attention\nfor robustness and reduced inference costs. SHIP significantly improves\nperformance, achieving a 4.9% gain in accuracy over VPT with a ViT-B/16\nbackbone on VTAB-1k tasks. Our code is available at\nhttps://github.com/haoweiz23/SHIP.\n","authors":["Haowei Zhu","Fangyuan Zhang","Rui Qin","Tianxiang Pan","Junhai Yong","Bin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.16956v2.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2303.07189v4","updated":"2024-12-24T09:06:15Z","published":"2023-03-13T15:30:28Z","title":"Optimizing Convolutional Neural Networks for Chronic Obstructive\n Pulmonary Disease Detection in Clinical Computed Tomography Imaging","summary":" We aim to optimize the binary detection of Chronic Obstructive Pulmonary\nDisease (COPD) based on emphysema presence in the lung with convolutional\nneural networks (CNN) by exploring manually adjusted versus automated\nwindow-setting optimization (WSO) on computed tomography (CT) images. 7,194 CT\nimages (3,597 with COPD; 3,597 healthy controls) from 78 subjects were selected\nretrospectively (10.2018-12.2021) and preprocessed. For each image, intensity\nvalues were manually clipped to the emphysema window setting and a baseline\n'full-range' window setting. Class-balanced train, validation, and test sets\ncontained 3,392, 1,114, and 2,688 images. The network backbone was optimized by\ncomparing various CNN architectures. Furthermore, automated WSO was implemented\nby adding a customized layer to the model. The image-level area under the\nReceiver Operating Characteristics curve (AUC) [lower, upper limit 95%\nconfidence] was utilized to compare model variations. Repeated inference (n=7)\non the test set showed that the DenseNet was the most efficient backbone and\nachieved a mean AUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input\nimages manually adjusted to the emphysema window, the DenseNet model predicted\nCOPD with a mean AUC of 0.86 [0.82, 0.89]. By adding a customized WSO layer to\nthe DenseNet, an optimal window in the proximity of the emphysema window\nsetting was learned automatically, and a mean AUC of 0.82 [0.78, 0.86] was\nachieved. Detection of COPD with DenseNet models was improved by WSO of CT data\nto the emphysema window setting range.\n","authors":["Tina Dorosti","Manuel Schultheiss","Felix Hofmann","Johannes Thalhammer","Luisa Kirchner","Theresa Urban","Franz Pfeiffer","Florian Schaff","Tobias Lasser","Daniela Pfeiffer"],"pdf_url":"https://arxiv.org/pdf/2303.07189v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18298v1","updated":"2024-12-24T09:05:37Z","published":"2024-12-24T09:05:37Z","title":"Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight","summary":" Video anomaly detection (VAD) has witnessed significant advancements through\nthe integration of large language models (LLMs) and vision-language models\n(VLMs), addressing critical challenges such as interpretability, temporal\nreasoning, and generalization in dynamic, open-world scenarios. This paper\npresents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024,\nfocusing on four key aspects: (i) enhancing interpretability through semantic\ninsights and textual explanations, making visual anomalies more understandable;\n(ii) capturing intricate temporal relationships to detect and localize dynamic\nanomalies across video frames; (iii) enabling few-shot and zero-shot detection\nto minimize reliance on large, annotated datasets; and (iv) addressing\nopen-world and class-agnostic anomalies by using semantic understanding and\nmotion features for spatiotemporal coherence. We highlight their potential to\nredefine the landscape of VAD. Additionally, we explore the synergy between\nvisual and textual modalities offered by LLMs and VLMs, highlighting their\ncombined strengths and proposing future directions to fully exploit the\npotential in enhancing video anomaly detection.\n","authors":["Xi Ding","Lei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18298v1.pdf","comment":"Research report"},{"id":"http://arxiv.org/abs/2412.18288v1","updated":"2024-12-24T08:52:06Z","published":"2024-12-24T08:52:06Z","title":"Towards understanding how attention mechanism works in deep learning","summary":" Attention mechanism has been extensively integrated within mainstream neural\nnetwork architectures, such as Transformers and graph attention networks. Yet,\nits underlying working principles remain somewhat elusive. What is its essence?\nAre there any connections between it and traditional machine learning\nalgorithms? In this study, we inspect the process of computing similarity using\nclassic metrics and vector space properties in manifold learning, clustering,\nand supervised learning. We identify the key characteristics of similarity\ncomputation and information propagation in these methods and demonstrate that\nthe self-attention mechanism in deep learning adheres to the same principles\nbut operates more flexibly and adaptively. We decompose the self-attention\nmechanism into a learnable pseudo-metric function and an information\npropagation process based on similarity computation. We prove that the\nself-attention mechanism converges to a drift-diffusion process through\ncontinuous modeling provided the pseudo-metric is a transformation of a metric\nand certain reasonable assumptions hold. This equation could be transformed\ninto a heat equation under a new metric. In addition, we give a first-order\nanalysis of attention mechanism with a general pseudo-metric function. This\nstudy aids in understanding the effects and principle of attention mechanism\nthrough physical intuition. Finally, we propose a modified attention mechanism\ncalled metric-attention by leveraging the concept of metric learning to\nfacilitate the ability to learn desired metrics more effectively. Experimental\nresults demonstrate that it outperforms self-attention regarding training\nefficiency, accuracy, and robustness.\n","authors":["Tianyu Ruan","Shihua Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18288v1.pdf","comment":"38 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.18911v3","updated":"2024-12-24T08:47:35Z","published":"2024-05-29T09:13:30Z","title":"Exploring Human-in-the-Loop Test-Time Adaptation by Synergizing Active\n Learning and Model Selection","summary":" Existing test-time adaptation (TTA) approaches often adapt models with the\nunlabeled testing data stream. A recent attempt relaxed the assumption by\nintroducing limited human annotation, referred to as Human-In-the-Loop\nTest-Time Adaptation (HILTTA) in this study. The focus of existing HILTTA\nstudies lies in selecting the most informative samples to label, a.k.a. active\nlearning. In this work, we are motivated by a pitfall of TTA, i.e. sensitivity\nto hyper-parameters, and propose to approach HILTTA by synergizing active\nlearning and model selection. Specifically, we first select samples for human\nannotation (active learning) and then use the labeled data to select optimal\nhyper-parameters (model selection). To prevent the model selection process from\noverfitting to local distributions, multiple regularization techniques are\nemployed to complement the validation objective. A sample selection strategy is\nfurther tailored by considering the balance between active learning and model\nselection purposes. We demonstrate on 5 TTA datasets that the proposed HILTTA\napproach is compatible with off-the-shelf TTA methods and such combinations\nsubstantially outperform the state-of-the-art HILTTA methods. Importantly, our\nproposed method can always prevent choosing the worst hyper-parameters on all\noff-the-shelf TTA methods. The source code is available at\nhttps://github.com/Yushu-Li/HILTTA.\n","authors":["Yushu Li","Yongyi Su","Xulei Yang","Kui Jia","Xun Xu"],"pdf_url":"https://arxiv.org/pdf/2405.18911v3.pdf","comment":"Accepted at Transactions on Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2412.15670v2","updated":"2024-12-24T08:44:04Z","published":"2024-12-20T08:36:17Z","title":"BS-LDM: Effective Bone Suppression in High-Resolution Chest X-Ray Images\n with Conditional Latent Diffusion Models","summary":" The interference of overlapping bones and pulmonary structures can reduce the\neffectiveness of Chest X-ray (CXR) examinations. Bone suppression techniques\nhave been developed to improve diagnostic accuracy. Dual-energy subtraction\n(DES) imaging, a common method for bone suppression, is costly and exposes\npatients to higher radiation levels. Deep learning-based image generation\nmethods have been proposed as alternatives, however, they often fail to produce\nhigh-quality and high-resolution images, resulting in the loss of critical\nlesion information and texture details. To address these issues, in this paper,\nwe introduce an end-to-end framework for bone suppression in high-resolution\nCXR images, termed BS-LDM. This framework employs a conditional latent\ndiffusion model to generate high-resolution soft tissue images with fine detail\nand critical lung pathology by performing bone suppression in the latent space.\nWe implement offset noise during the noise addition phase of the training\nprocess to better render low-frequency information in soft tissue images.\nAdditionally, we introduce a dynamic clipping strategy during the sampling\nprocess to refine pixel intensity in the generated soft tissue images. We\ncompiled a substantial and high-quality bone suppression dataset, SZCH-X-Rays,\nincluding high-resolution paired CXR and DES soft tissue images from 818\npatients, collected from our partner hospitals. Moreover, we pre-processed 241\npairs of CXR and DES soft tissue images from the JSRT dataset, the largest\npublicly available dataset. Comprehensive experimental and clinical evaluations\ndemonstrate that BS-LDM exhibits superior bone suppression capabilities,\nhighlighting its significant clinical potential.\n","authors":["Yifei Sun","Zhanghao Chen","Hao Zheng","Ruiquan Ge","Jin Liu","Wenwen Min","Ahmed Elazab","Xiang Wan","Changmiao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.15670v2.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.18282v1","updated":"2024-12-24T08:42:16Z","published":"2024-12-24T08:42:16Z","title":"Improved Feature Generating Framework for Transductive Zero-shot\n Learning","summary":" Feature Generative Adversarial Networks have emerged as powerful generative\nmodels in producing high-quality representations of unseen classes within the\nscope of Zero-shot Learning (ZSL). This paper delves into the pivotal influence\nof unseen class priors within the framework of transductive ZSL (TZSL) and\nilluminates the finding that even a marginal prior bias can result in\nsubstantial accuracy declines. Our extensive analysis uncovers that this\ninefficacy fundamentally stems from the utilization of an unconditional unseen\ndiscriminator - a core component in existing TZSL. We further establish that\nthe detrimental effects of this component are inevitable unless the generator\nperfectly fits class-specific distributions. Building on these insights, we\nintroduce our Improved Feature Generation Framework, termed I-VAEGAN, which\nincorporates two novel components: Pseudo-conditional Feature Adversarial (PFA)\nlearning and Variational Embedding Regression (VER). PFA circumvents the need\nfor prior estimation by explicitly injecting the predicted semantics as pseudo\nconditions for unseen classes premised by precise semantic regression.\nMeanwhile, VER utilizes reconstructive pre-training to learn class statistics,\nobtaining better semantic regression. Our I-VAEGAN achieves state-of-the-art\nTZSL accuracy across various benchmarks and priors. Our code would be released\nupon acceptance.\n","authors":["Zihan Ye","Xinyuan Ru","Shiming Chen","Yaochu Jin","Kaizhu Huang","Xiaobo Jin"],"pdf_url":"https://arxiv.org/pdf/2412.18282v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2207.08960v4","updated":"2024-12-24T08:40:58Z","published":"2022-07-18T22:10:57Z","title":"Enhancing Space-time Video Super-resolution via Spatial-temporal Feature\n Interaction","summary":" The target of space-time video super-resolution (STVSR) is to increase both\nthe frame rate (also referred to as the temporal resolution) and the spatial\nresolution of a given video. Recent approaches solve STVSR using end-to-end\ndeep neural networks. A popular solution is to first increase the frame rate of\nthe video; then perform feature refinement among different frame features; and\nlast increase the spatial resolutions of these features. The temporal\ncorrelation among features of different frames is carefully exploited in this\nprocess. The spatial correlation among features of different (spatial)\nresolutions, despite being also very important, is however not emphasized. In\nthis paper, we propose a spatial-temporal feature interaction network to\nenhance STVSR by exploiting both spatial and temporal correlations among\nfeatures of different frames and spatial resolutions. Specifically, the\nspatial-temporal frame interpolation module is introduced to interpolate low-\nand high-resolution intermediate frame features simultaneously and\ninteractively. The spatial-temporal local and global refinement modules are\nrespectively deployed afterwards to exploit the spatial-temporal correlation\namong different features for their refinement. Finally, a novel motion\nconsistency loss is employed to enhance the motion continuity among\nreconstructed frames. We conduct experiments on three standard benchmarks,\nVid4, Vimeo-90K and Adobe240, and the results demonstrate that our method\nimproves the state of the art methods by a considerable margin. Our codes will\nbe available at\nhttps://github.com/yuezijie/STINet-Space-time-Video-Super-resolution.\n","authors":["Zijie Yue","Miaojing Shi"],"pdf_url":"https://arxiv.org/pdf/2207.08960v4.pdf","comment":"Neural Networks"},{"id":"http://arxiv.org/abs/2412.18277v1","updated":"2024-12-24T08:38:35Z","published":"2024-12-24T08:38:35Z","title":"Towards Modality Generalization: A Benchmark and Prospective Analysis","summary":" Multi-modal learning has achieved remarkable success by integrating\ninformation from various modalities, achieving superior performance in tasks\nlike recognition and retrieval compared to uni-modal approaches. However,\nreal-world scenarios often present novel modalities that are unseen during\ntraining due to resource and privacy constraints, a challenge current methods\nstruggle to address. This paper introduces Modality Generalization (MG), which\nfocuses on enabling models to generalize to unseen modalities. We define two\ncases: weak MG, where both seen and unseen modalities can be mapped into a\njoint embedding space via existing perceptors, and strong MG, where no such\nmappings exist. To facilitate progress, we propose a comprehensive benchmark\nfeaturing multi-modal algorithms and adapt existing methods that focus on\ngeneralization. Extensive experiments highlight the complexity of MG, exposing\nthe limitations of existing methods and identifying key directions for future\nresearch. Our work provides a foundation for advancing robust and adaptable\nmulti-modal models, enabling them to handle unseen modalities in realistic\nscenarios.\n","authors":["Xiaohao Liu","Xiaobo Xia","Zhuo Huang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2412.18277v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18276v1","updated":"2024-12-24T08:38:34Z","published":"2024-12-24T08:38:34Z","title":"UNet--: Memory-Efficient and Feature-Enhanced Network Architecture based\n on U-Net with Reduced Skip-Connections","summary":" U-Net models with encoder, decoder, and skip-connections components have\ndemonstrated effectiveness in a variety of vision tasks. The skip-connections\ntransmit fine-grained information from the encoder to the decoder. It is\nnecessary to maintain the feature maps used by the skip-connections in memory\nbefore the decoding stage. Therefore, they are not friendly to devices with\nlimited resource. In this paper, we propose a universal method and architecture\nto reduce the memory consumption and meanwhile generate enhanced feature maps\nto improve network performance. To this end, we design a simple but effective\nMulti-Scale Information Aggregation Module (MSIAM) in the encoder and an\nInformation Enhancement Module (IEM) in the decoder. The MSIAM aggregates\nmulti-scale feature maps into single-scale with less memory. After that, the\naggregated feature maps can be expanded and enhanced to multi-scale feature\nmaps by the IEM. By applying the proposed method on NAFNet, a SOTA model in the\nfield of image restoration, we design a memory-efficient and feature-enhanced\nnetwork architecture, UNet--. The memory demand by the skip-connections in the\nUNet-- is reduced by 93.3%, while the performance is improved compared to\nNAFNet. Furthermore, we show that our proposed method can be generalized to\nmultiple visual tasks, with consistent improvements in both memory consumption\nand network accuracy compared to the existing efficient architectures.\n","authors":["Lingxiao Yin","Wei Tao","Dongyue Zhao","Tadayuki Ito","Kinya Osa","Masami Kato","Tse-Wei Chen"],"pdf_url":"https://arxiv.org/pdf/2412.18276v1.pdf","comment":"17 pages, 7 figures, accepted by ACCV2024"},{"id":"http://arxiv.org/abs/2412.18273v1","updated":"2024-12-24T08:32:38Z","published":"2024-12-24T08:32:38Z","title":"Sampling Bag of Views for Open-Vocabulary Object Detection","summary":" Existing open-vocabulary object detection (OVD) develops methods for testing\nunseen categories by aligning object region embeddings with corresponding VLM\nfeatures. A recent study leverages the idea that VLMs implicitly learn\ncompositional structures of semantic concepts within the image. Instead of\nusing an individual region embedding, it utilizes a bag of region embeddings as\na new representation to incorporate compositional structures into the OVD task.\nHowever, this approach often fails to capture the contextual concepts of each\nregion, leading to noisy compositional structures. This results in only\nmarginal performance improvements and reduced efficiency. To address this, we\npropose a novel concept-based alignment method that samples a more powerful and\nefficient compositional structure. Our approach groups contextually related\n``concepts'' into a bag and adjusts the scale of concepts within the bag for\nmore effective embedding alignment. Combined with Faster R-CNN, our method\nachieves improvements of 2.6 box AP50 and 0.5 mask AP over prior work on novel\ncategories in the open-vocabulary COCO and LVIS benchmarks. Furthermore, our\nmethod reduces CLIP computation in FLOPs by 80.3% compared to previous\nresearch, significantly enhancing efficiency. Experimental results demonstrate\nthat the proposed method outperforms previous state-of-the-art models on the\nOVD datasets.\n","authors":["Hojun Choi","Junsuk Choe","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2412.18273v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2410.01020v2","updated":"2024-12-24T08:27:09Z","published":"2024-10-01T19:28:45Z","title":"A Critical Assessment of Visual Sound Source Localization Models\n Including Negative Audio","summary":" The task of Visual Sound Source Localization (VSSL) involves identifying the\nlocation of sound sources in visual scenes, integrating audio-visual data for\nenhanced scene understanding. Despite advancements in state-of-the-art (SOTA)\nmodels, we observe three critical flaws: i) The evaluation of the models is\nmainly focused in sounds produced by objects that are visible in the image, ii)\nThe evaluation often assumes a prior knowledge of the size of the sounding\nobject, and iii) No universal threshold for localization in real-world\nscenarios is established, as previous approaches only consider positive\nexamples without accounting for both positive and negative cases. In this\npaper, we introduce a novel test set and metrics designed to complete the\ncurrent standard evaluation of VSSL models by testing them in scenarios where\nnone of the objects in the image corresponds to the audio input, i.e. a\nnegative audio. We consider three types of negative audio: silence, noise and\noffscreen. Our analysis reveals that numerous SOTA models fail to appropriately\nadjust their predictions based on audio input, suggesting that these models may\nnot be leveraging audio information as intended. Additionally, we provide a\ncomprehensive analysis of the range of maximum values in the estimated\naudio-visual similarity maps, in both positive and negative audio cases, and\nshow that most of the models are not discriminative enough, making them unfit\nto choose a universal threshold appropriate to perform sound localization\nwithout any a priori information of the sounding object, that is, object size\nand visibility.\n","authors":["Xavier Juanola","Gloria Haro","Magdalena Fuentes"],"pdf_url":"https://arxiv.org/pdf/2410.01020v2.pdf","comment":"Accepted in ICASSP 2025"},{"id":"http://arxiv.org/abs/2409.08772v2","updated":"2024-12-24T08:18:25Z","published":"2024-09-13T12:30:15Z","title":"The Practice of Averaging Rate-Distortion Curves over Testsets to\n Compare Learned Video Codecs Can Cause Misleading Conclusions","summary":" This paper aims to demonstrate how the prevalent practice in the learned\nvideo compression community of averaging rate-distortion (RD) curves across a\ntest video set can lead to misleading conclusions in evaluating codec\nperformance. Through analytical analysis of a simple case and experimental\nresults with two recent learned video codecs, we show how averaged RD curves\ncan mislead comparative evaluation of different codecs, particularly when\nvideos in a dataset have varying characteristics and operating ranges. We\nillustrate how a single video with distinct RD characteristics from the rest of\nthe test set can disproportionately influence the average RD curve, potentially\novershadowing a codec's superior performance across most individual sequences.\nUsing two recent learned video codecs on the UVG dataset as a case study, we\ndemonstrate computing performance metrics, such as the BD rate, from the\naverage RD curve suggests conclusions that contradict those reached from\ncalculating the average of per-sequence metrics. Hence, we argue that the\nlearned video compression community should also report per-sequence RD curves\nand performance metrics for a test set should be computed from the average of\nper-sequence metrics, similar to the established practice in traditional video\ncoding, to ensure fair and accurate codec comparisons.\n","authors":["M. Akin Yilmaz","Onur Keleş","A. Murat Tekalp"],"pdf_url":"https://arxiv.org/pdf/2409.08772v2.pdf","comment":"Submitted to IEEE Signal Processing Letters"},{"id":"http://arxiv.org/abs/2407.03771v4","updated":"2024-12-24T08:15:48Z","published":"2024-07-04T09:32:12Z","title":"SpikeGS: Reconstruct 3D scene via fast-moving bio-inspired sensors","summary":" 3D Gaussian Splatting (3DGS) demonstrates unparalleled superior performance\nin 3D scene reconstruction. However, 3DGS heavily relies on the sharp images.\nFulfilling this requirement can be challenging in real-world scenarios\nespecially when the camera moves fast, which severely limits the application of\n3DGS. To address these challenges, we proposed Spike Gausian Splatting\n(SpikeGS), the first framework that integrates the spike streams into 3DGS\npipeline to reconstruct 3D scenes via a fast-moving bio-inspired camera. With\naccumulation rasterization, interval supervision, and a specially designed\npipeline, SpikeGS extracts detailed geometry and texture from high temporal\nresolution but texture lacking spike stream, reconstructs 3D scenes captured in\n1 second. Extensive experiments on multiple synthetic and real-world datasets\ndemonstrate the superiority of SpikeGS compared with existing spike-based and\ndeblur 3D scene reconstruction methods. Codes and data will be released soon.\n","authors":["Yijia Guo","Liwen Hu","Yuanxi Bai","Jiawei Yao","Lei Ma","Tiejun Huang"],"pdf_url":"https://arxiv.org/pdf/2407.03771v4.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2412.18255v1","updated":"2024-12-24T08:12:31Z","published":"2024-12-24T08:12:31Z","title":"AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic\n Segmentation via Adaptive Label Correction","summary":" Recently, Visual Foundation Models (VFMs) have shown a remarkable\ngeneralization performance in 3D perception tasks. However, their effectiveness\nin large-scale outdoor datasets remains constrained by the scarcity of accurate\nsupervision signals, the extensive noise caused by variable outdoor conditions,\nand the abundance of unknown objects. In this work, we propose a novel\nlabel-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic\nsegmentation. AdaCo first introduces the Cross-modal Label Generation Module\n(CLGM), providing cross-modal supervision with the formidable interpretive\ncapabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise\nCorrector (ANC), updating and adjusting the noisy samples within this\nsupervision iteratively during training. Moreover, we develop an Adaptive\nRobust Loss (ARL) function to modulate each sample's sensitivity to noisy\nsupervision, preventing potential underfitting issues associated with robust\nloss. Our proposed AdaCo can effectively mitigate the performance limitations\nof label-free learning networks in 3D semantic segmentation tasks. Extensive\nexperiments on two outdoor benchmark datasets highlight the superior\nperformance of our method.\n","authors":["Pufan Zou","Shijia Zhao","Weijie Huang","Qiming Xia","Chenglu Wen","Wei Li","Cheng Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18255v1.pdf","comment":"2025 AAAI"},{"id":"http://arxiv.org/abs/2405.08297v2","updated":"2024-12-24T08:12:26Z","published":"2024-05-14T03:42:33Z","title":"Distance-Restricted Explanations: Theoretical Underpinnings & Efficient\n Implementation","summary":" The uses of machine learning (ML) have snowballed in recent years. In many\ncases, ML models are highly complex, and their operation is beyond the\nunderstanding of human decision-makers. Nevertheless, some uses of ML models\ninvolve high-stakes and safety-critical applications. Explainable artificial\nintelligence (XAI) aims to help human decision-makers in understanding the\noperation of such complex ML models, thus eliciting trust in their operation.\nUnfortunately, the majority of past XAI work is based on informal approaches,\nthat offer no guarantees of rigor. Unsurprisingly, there exists comprehensive\nexperimental and theoretical evidence confirming that informal methods of XAI\ncan provide human-decision makers with erroneous information. Logic-based XAI\nrepresents a rigorous approach to explainability; it is model-based and offers\nthe strongest guarantees of rigor of computed explanations. However, a\nwell-known drawback of logic-based XAI is the complexity of logic reasoning,\nespecially for highly complex ML models. Recent work proposed\ndistance-restricted explanations, i.e. explanations that are rigorous provided\nthe distance to a given input is small enough. Distance-restricted\nexplainability is tightly related with adversarial robustness, and it has been\nshown to scale for moderately complex ML models, but the number of inputs still\nrepresents a key limiting factor. This paper investigates novel algorithms for\nscaling up the performance of logic-based explainers when computing and\nenumerating ML model explanations with a large number of inputs.\n","authors":["Yacine Izza","Xuanxiang Huang","Antonio Morgado","Jordi Planes","Alexey Ignatiev","Joao Marques-Silva"],"pdf_url":"https://arxiv.org/pdf/2405.08297v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18254v1","updated":"2024-12-24T08:08:29Z","published":"2024-12-24T08:08:29Z","title":"RaCMC: Residual-Aware Compensation Network with Multi-Granularity\n Constraints for Fake News Detection","summary":" Multimodal fake news detection aims to automatically identify real or fake\nnews, thereby mitigating the adverse effects caused by such misinformation.\nAlthough prevailing approaches have demonstrated their effectiveness,\nchallenges persist in cross-modal feature fusion and refinement for\nclassification. To address this, we present a residual-aware compensation\nnetwork with multi-granularity constraints (RaCMC) for fake news detection,\nthat aims to sufficiently interact and fuse cross-modal features while\namplifying the differences between real and fake news. First, a multiscale\nresidual-aware compensation module is designed to interact and fuse features at\ndifferent scales, and ensure both the consistency and exclusivity of feature\ninteraction, thus acquiring high-quality features. Second, a multi-granularity\nconstraints module is implemented to limit the distribution of both the news\noverall and the image-text pairs within the news, thus amplifying the\ndifferences between real and fake news at the news and feature levels. Finally,\na dominant feature fusion reasoning module is developed to comprehensively\nevaluate news authenticity from the perspectives of both consistency and\ninconsistency. Experiments on three public datasets, including Weibo17,\nPolitifact and GossipCop, reveal the superiority of the proposed method.\n","authors":["Xinquan Yu","Ziqi Sheng","Wei Lu","Xiangyang Luo","Jiantao Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.18254v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.18249v1","updated":"2024-12-24T08:02:44Z","published":"2024-12-24T08:02:44Z","title":"An Improved Fault Diagnosis Strategy for Induction Motors Using Weighted\n Probability Ensemble Deep Learning","summary":" Early detection of faults in induction motors is crucial for ensuring\nuninterrupted operations in industrial settings. Among the various fault types\nencountered in induction motors, bearing, rotor, and stator faults are the most\nprevalent. This paper introduces a Weighted Probability Ensemble Deep Learning\n(WPEDL) methodology, tailored for effectively diagnosing induction motor faults\nusing high-dimensional data extracted from vibration and current features. The\nShort-Time Fourier Transform (STFT) is employed to extract features from both\nvibration and current signals. The performance of the WPEDL fault diagnosis\nmethod is compared against conventional deep learning models, demonstrating the\nsuperior efficacy of the proposed system. The multi-class fault diagnosis\nsystem based on WPEDL achieves high accuracies across different fault types:\n99.05% for bearing (vibrational signal), 99.10%, and 99.50% for rotor (current\nand vibration signal), and 99.60%, and 99.52% for stator faults (current and\nvibration signal) respectively. To evaluate the robustness of our multi-class\nclassification decisions, tests have been conducted on a combined dataset of\n52,000 STFT images encompassing all three faults. Our proposed model\noutperforms other models, achieving an accuracy of 98.89%. The findings\nunderscore the effectiveness and reliability of the WPEDL approach for\nearly-stage fault diagnosis in IMs, offering promising insights for enhancing\nindustrial operational efficiency and reliability.\n","authors":["Usman Ali","Waqas Ali","Umer Ramzan"],"pdf_url":"https://arxiv.org/pdf/2412.18249v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16576v2","updated":"2024-12-24T07:56:48Z","published":"2024-12-21T10:40:56Z","title":"Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive\n Learning with Dense Labeling","summary":" Growing labor shortages are increasing the demand for domestic service robots\n(DSRs) to assist in various settings. In this study, we develop a DSR that\ntransports everyday objects to specified pieces of furniture based on\nopen-vocabulary instructions. Our approach focuses on retrieving images of\ntarget objects and receptacles from pre-collected images of indoor\nenvironments. For example, given an instruction \"Please get the right red towel\nhanging on the metal towel rack and put it in the white washing machine on the\nleft,\" the DSR is expected to carry the red towel to the washing machine based\non the retrieved images. This is challenging because the correct images should\nbe retrieved from thousands of collected images, which may include many images\nof similar towels and appliances. To address this, we propose RelaX-Former,\nwhich learns diverse and robust representations from among positive, unlabeled\npositive, and negative samples. We evaluated RelaX-Former on a dataset\ncontaining real-world indoor images and human annotated instructions including\ncomplex referring expressions. The experimental results demonstrate that\nRelaX-Former outperformed existing baseline models across standard image\nretrieval metrics. Moreover, we performed physical experiments using a DSR to\nevaluate the performance of our approach in a zero-shot transfer setting. The\nexperiments involved the DSR to carry objects to specific receptacles based on\nopen-vocabulary instructions, achieving an overall success rate of 75%.\n","authors":["Daichi Yashima","Ryosuke Korekata","Komei Sugiura"],"pdf_url":"https://arxiv.org/pdf/2412.16576v2.pdf","comment":"Accepted for IEEE RA-L 2025"},{"id":"http://arxiv.org/abs/2412.17387v2","updated":"2024-12-24T07:52:45Z","published":"2024-12-23T08:40:08Z","title":"Singular Value Scaling: Efficient Generative Model Compression via\n Pruned Weights Refinement","summary":" While pruning methods effectively maintain model performance without extra\ntraining costs, they often focus solely on preserving crucial connections,\noverlooking the impact of pruned weights on subsequent fine-tuning or\ndistillation, leading to inefficiencies. Moreover, most compression techniques\nfor generative models have been developed primarily for GANs, tailored to\nspecific architectures like StyleGAN, and research into compressing Diffusion\nmodels has just begun. Even more, these methods are often applicable only to\nGANs or Diffusion models, highlighting the need for approaches that work across\nboth model types. In this paper, we introduce Singular Value Scaling (SVS), a\nversatile technique for refining pruned weights, applicable to both model\ntypes. Our analysis reveals that pruned weights often exhibit dominant singular\nvectors, hindering fine-tuning efficiency and leading to suboptimal performance\ncompared to random initialization. Our method enhances weight initialization by\nminimizing the disparities between singular values of pruned weights, thereby\nimproving the fine-tuning process. This approach not only guides the compressed\nmodel toward superior solutions but also significantly speeds up fine-tuning.\nExtensive experiments on StyleGAN2, StyleGAN3 and DDPM demonstrate that SVS\nimproves compression performance across model types without additional training\ncosts. Our code is available at:\nhttps://github.com/LAIT-CVLab/Singular-Value-Scaling.\n","authors":["Hyeonjin Kim","Jaejun Yoo"],"pdf_url":"https://arxiv.org/pdf/2412.17387v2.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2404.00717v3","updated":"2024-12-24T07:51:15Z","published":"2024-03-31T15:22:11Z","title":"End-to-End Autonomous Driving through V2X Cooperation","summary":" Cooperatively utilizing both ego-vehicle and infrastructure sensor data via\nV2X communication has emerged as a promising approach for advanced autonomous\ndriving. However, current research mainly focuses on improving individual\nmodules, rather than taking end-to-end learning to optimize final planning\nperformance, resulting in underutilized data potential. In this paper, we\nintroduce UniV2X, a pioneering cooperative autonomous driving framework that\nseamlessly integrates all key driving modules across diverse views into a\nunified network. We propose a sparse-dense hybrid data transmission and fusion\nmechanism for effective vehicle-infrastructure cooperation, offering three\nadvantages: 1) Effective for simultaneously enhancing agent perception, online\nmapping, and occupancy prediction, ultimately improving planning performance.\n2) Transmission-friendly for practical and limited communication conditions. 3)\nReliable data fusion with interpretability of this hybrid data. We implement\nUniV2X, as well as reproducing several benchmark methods, on the challenging\nDAIR-V2X, the real-world cooperative driving dataset. Experimental results\ndemonstrate the effectiveness of UniV2X in significantly enhancing planning\nperformance, as well as all intermediate output performance. The project is\navailable at\n\\href{https://github.com/AIR-THU/UniV2X}{https://github.com/AIR-THU/UniV2X}.\n","authors":["Haibao Yu","Wenxian Yang","Jiaru Zhong","Zhenwei Yang","Siqi Fan","Ping Luo","Zaiqing Nie"],"pdf_url":"https://arxiv.org/pdf/2404.00717v3.pdf","comment":"Accepted by AAAI 2025. Add more open-loop evaluation indicators"},{"id":"http://arxiv.org/abs/2412.18235v1","updated":"2024-12-24T07:40:07Z","published":"2024-12-24T07:40:07Z","title":"Band Prompting Aided SAR and Multi-Spectral Data Fusion Framework for\n Local Climate Zone Classification","summary":" Local climate zone (LCZ) classification is of great value for understanding\nthe complex interactions between urban development and local climate. Recent\nstudies have increasingly focused on the fusion of synthetic aperture radar\n(SAR) and multi-spectral data to improve LCZ classification performance.\nHowever, it remains challenging due to the distinct physical properties of\nthese two types of data and the absence of effective fusion guidance. In this\npaper, a novel band prompting aided data fusion framework is proposed for LCZ\nclassification, namely BP-LCZ, which utilizes textual prompts associated with\nband groups to guide the model in learning the physical attributes of different\nbands and semantics of various categories inherent in SAR and multi-spectral\ndata to augment the fused feature, thus enhancing LCZ classification\nperformance. Specifically, a band group prompting (BGP) strategy is introduced\nto align the visual representation effectively at the level of band groups,\nwhich also facilitates a more adequate extraction of semantic information of\ndifferent bands with textual information. In addition, a multivariate\nsupervised matrix (MSM) based training strategy is proposed to alleviate the\nproblem of positive and negative sample confusion by completing the supervised\ninformation. The experimental results demonstrate the effectiveness and\nsuperiority of the proposed data fusion framework.\n","authors":["Haiyan Lan","Shujun Li","Mingjie Xie","Xuanjia Zhao","Hongning Liu","Pengming Feng","Dongli Xu","Guangjun He","Jian Guan"],"pdf_url":"https://arxiv.org/pdf/2412.18235v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18230v1","updated":"2024-12-24T07:28:10Z","published":"2024-12-24T07:28:10Z","title":"Efficient Detection Framework Adaptation for Edge Computing: A\n Plug-and-play Neural Network Toolbox Enabling Edge Deployment","summary":" Edge computing has emerged as a key paradigm for deploying deep\nlearning-based object detection in time-sensitive scenarios. However, existing\nedge detection methods face challenges: 1) difficulty balancing detection\nprecision with lightweight models, 2) limited adaptability of generalized\ndeployment designs, and 3) insufficient real-world validation. To address these\nissues, we propose the Edge Detection Toolbox (ED-TOOLBOX), which utilizes\ngeneralizable plug-and-play components to adapt object detection models for\nedge environments. Specifically, we introduce a lightweight Reparameterized\nDynamic Convolutional Network (Rep-DConvNet) featuring weighted multi-shape\nconvolutional branches to enhance detection performance. Additionally, we\ndesign a Sparse Cross-Attention (SC-A) network with a\nlocalized-mapping-assisted self-attention mechanism, enabling a well-crafted\njoint module for adaptive feature transfer. For real-world applications, we\nincorporate an Efficient Head into the YOLO framework to accelerate edge model\noptimization. To demonstrate practical impact, we identify a gap in helmet\ndetection -- overlooking band fastening, a critical safety factor -- and create\nthe Helmet Band Detection Dataset (HBDD). Using ED-TOOLBOX-optimized models, we\naddress this real-world task. Extensive experiments validate the effectiveness\nof ED-TOOLBOX, with edge detection models outperforming six state-of-the-art\nmethods in visual surveillance simulations, achieving real-time and accurate\nperformance. These results highlight ED-TOOLBOX as a superior solution for edge\nobject detection.\n","authors":["Jiaqi Wu","Shihao Zhang","Simin Chen","Lixu Wang","Zehua Wang","Wei Chen","Fangyuan He","Zijian Tian","F. Richard Yu","Victor C. M. Leung"],"pdf_url":"https://arxiv.org/pdf/2412.18230v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18224v1","updated":"2024-12-24T07:13:17Z","published":"2024-12-24T07:13:17Z","title":"Expand VSR Benchmark for VLLM to Expertize in Spatial Rules","summary":" Distinguishing spatial relations is a basic part of human cognition which\nrequires fine-grained perception on cross-instance. Although benchmarks like\nMME, MMBench and SEED comprehensively have evaluated various capabilities which\nalready include visual spatial reasoning(VSR). There is still a lack of\nsufficient quantity and quality evaluation and optimization datasets for Vision\nLarge Language Models(VLLMs) specifically targeting visual positional\nreasoning. To handle this, we first diagnosed current VLLMs with the VSR\ndataset and proposed a unified test set. We found current VLLMs to exhibit a\ncontradiction of over-sensitivity to language instructions and\nunder-sensitivity to visual positional information. By expanding the original\nbenchmark from two aspects of tunning data and model structure, we mitigated\nthis phenomenon. To our knowledge, we expanded spatially positioned image data\ncontrollably using diffusion models for the first time and integrated original\nvisual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and\nDINO). After conducting combination experiments on scaling data and models, we\nobtained a VLLM VSR Expert(VSRE) that not only generalizes better to different\ninstructions but also accurately distinguishes differences in visual positional\ninformation. VSRE achieved over a 27\\% increase in accuracy on the VSR test\nset. It becomes a performant VLLM on the position reasoning of both the VSR\ndataset and relevant subsets of other evaluation benchmarks. We open-sourced\nthe expanded model with data and Appendix at\n\\url{https://github.com/peijin360/vsre} and hope it will accelerate\nadvancements in VLLM on VSR learning.\n","authors":["Peijin Xie","Lin Sun","Bingquan Liu","Dexin Wang","Xiangzheng Zhang","Chengjie Sun","Jiajia Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18224v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18221v1","updated":"2024-12-24T07:05:55Z","published":"2024-12-24T07:05:55Z","title":"GIMS: Image Matching System Based on Adaptive Graph Construction and\n Graph Neural Network","summary":" Feature-based image matching has extensive applications in computer vision.\nKeypoints detected in images can be naturally represented as graph structures,\nand Graph Neural Networks (GNNs) have been shown to outperform traditional deep\nlearning techniques. Consequently, the paradigm of image matching via GNNs has\ngained significant prominence in recent academic research. In this paper, we\nfirst introduce an innovative adaptive graph construction method that utilizes\na filtering mechanism based on distance and dynamic threshold similarity. This\nmethod dynamically adjusts the criteria for incorporating new vertices based on\nthe characteristics of existing vertices, allowing for the construction of more\nprecise and robust graph structures while avoiding redundancy. We further\ncombine the vertex processing capabilities of GNNs with the global awareness\ncapabilities of Transformers to enhance the model's representation of spatial\nand feature information within graph structures. This hybrid model provides a\ndeeper understanding of the interrelationships between vertices and their\ncontributions to the matching process. Additionally, we employ the Sinkhorn\nalgorithm to iteratively solve for optimal matching results. Finally, we\nvalidate our system using extensive image datasets and conduct comprehensive\ncomparative experiments. Experimental results demonstrate that our system\nachieves an average improvement of 3.8x-40.3x in overall matching performance.\nAdditionally, the number of vertices and edges significantly impacts training\nefficiency and memory usage; therefore, we employ multi-GPU technology to\naccelerate the training process. Our code is available at\nhttps://github.com/songxf1024/GIMS.\n","authors":["Xianfeng Song","Yi Zou","Zheng Shi","Zheng Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18221v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17155v2","updated":"2024-12-24T07:01:36Z","published":"2024-12-22T20:33:59Z","title":"The Potential of Convolutional Neural Networks for Cancer Detection","summary":" Early detection of cancer is critical in improving treatment outcomes and\nincreasing survival rates, particularly for common cancers such as lung,\nbreast, and prostate which collectively contribute to a significant global\nmortality burden. With advancements in imaging technologies and data\nprocessing, Convolutional Neural Networks (CNNs) have emerged as a powerful\ntool for analyzing and classifying medical images, enabling more precise cancer\ndetection. This paper provides a comprehensive review of recent studies\nleveraging CNN models for detecting ten different types of cancer. Each study\nemploys distinct CNN architectures to identify patterns associated with these\ncancers, utilizing diverse datasets. Key differences and strengths of these\narchitectures are meticulously compared and analyzed, highlighting their\nefficacy in improving early detection. Beyond reviewing the performance and\nlimitations of CNN-based cancer detection methods, this study explores the\nfeasibility of integrating CNNs into clinical settings as an early detection\ntool, potentially complementing or replacing traditional methods. Despite\nsignificant progress, challenges remain, including data diversity, result\ninterpretation, and ethical considerations. By identifying the best-performing\nCNN architectures and providing a comparative analysis, this study aims to\ncontribute a comprehensive perspective on the application of CNNs in cancer\ndetection and their role in advancing diagnostic capabilities in healthcare.\n","authors":["Hossein Molaeian","Kaveh Karamjani","Sina Teimouri","Saeed Roshani","Sobhan Roshani"],"pdf_url":"https://arxiv.org/pdf/2412.17155v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18219v1","updated":"2024-12-24T06:57:16Z","published":"2024-12-24T06:57:16Z","title":"Adapter Merging with Centroid Prototype Mapping for Scalable\n Class-Incremental Learning","summary":" We propose Adapter Merging with Centroid Prototype Mapping (ACMap), an\nexemplar-free framework for class-incremental learning (CIL) that addresses\nboth catastrophic forgetting and scalability. While existing methods trade-off\nbetween inference time and accuracy, ACMap consolidates task-specific adapters\ninto a single adapter, ensuring constant inference time across tasks without\ncompromising accuracy. The framework employs adapter merging to build a shared\nsubspace that aligns task representations and mitigates forgetting, while\ncentroid prototype mapping maintains high accuracy through consistent\nadaptation in the shared subspace. To further improve scalability, an early\nstopping strategy limits adapter merging as tasks increase. Extensive\nexperiments on five benchmark datasets demonstrate that ACMap matches\nstate-of-the-art accuracy while maintaining inference time comparable to the\nfastest existing methods. The code is available at\nhttps://github.com/tf63/ACMap\n","authors":["Takuma Fukuda","Hiroshi Kera","Kazuhiko Kawamoto"],"pdf_url":"https://arxiv.org/pdf/2412.18219v1.pdf","comment":"11 pages (main text), 6 pages (supplementary material)"},{"id":"http://arxiv.org/abs/2412.18216v1","updated":"2024-12-24T06:45:36Z","published":"2024-12-24T06:45:36Z","title":"ICM-Assistant: Instruction-tuning Multimodal Large Language Models for\n Rule-based Explainable Image Content Moderation","summary":" Controversial contents largely inundate the Internet, infringing various\ncultural norms and child protection standards. Traditional Image Content\nModeration (ICM) models fall short in producing precise moderation decisions\nfor diverse standards, while recent multimodal large language models (MLLMs),\nwhen adopted to general rule-based ICM, often produce classification and\nexplanation results that are inconsistent with human moderators. Aiming at\nflexible, explainable, and accurate ICM, we design a novel rule-based dataset\ngeneration pipeline, decomposing concise human-defined rules and leveraging\nwell-designed multi-stage prompts to enrich short explicit image annotations.\nOur ICM-Instruct dataset includes detailed moderation explanation and\nmoderation Q-A pairs. Built upon it, we create our ICM-Assistant model in the\nframework of rule-based ICM, making it readily applicable in real practice. Our\nICM-Assistant model demonstrates exceptional performance and flexibility.\nSpecifically, it significantly outperforms existing approaches on various\nsources, improving both the moderation classification (36.8\\% on average) and\nmoderation explanation quality (26.6\\% on average) consistently over existing\nMLLMs. Code/Data is available at https://github.com/zhaoyuzhi/ICM-Assistant.\n","authors":["Mengyang Wu","Yuzhi Zhao","Jialun Cao","Mingjie Xu","Zhongming Jiang","Xuehui Wang","Qinbin Li","Guangneng Hu","Shengchao Qin","Chi-Wing Fu"],"pdf_url":"https://arxiv.org/pdf/2412.18216v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2412.18214v1","updated":"2024-12-24T06:43:27Z","published":"2024-12-24T06:43:27Z","title":"SDM-Car: A Dataset for Small and Dim Moving Vehicles Detection in\n Satellite Videos","summary":" Vehicle detection and tracking in satellite video is essential in remote\nsensing (RS) applications. However, upon the statistical analysis of existing\ndatasets, we find that the dim vehicles with low radiation intensity and\nlimited contrast against the background are rarely annotated, which leads to\nthe poor effect of existing approaches in detecting moving vehicles under low\nradiation conditions. In this paper, we address the challenge by building a\n\\textbf{S}mall and \\textbf{D}im \\textbf{M}oving Cars (SDM-Car) dataset with a\nmultitude of annotations for dim vehicles in satellite videos, which is\ncollected by the Luojia 3-01 satellite and comprises 99 high-quality videos.\nFurthermore, we propose a method based on image enhancement and attention\nmechanisms to improve the detection accuracy of dim vehicles, serving as a\nbenchmark for evaluating the dataset. Finally, we assess the performance of\nseveral representative methods on SDM-Car and present insightful findings. The\ndataset is openly available at https://github.com/TanedaM/SDM-Car.\n","authors":["Zhen Zhang","Tao Peng","Liang Liao","Jing Xiao","Mi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18214v1.pdf","comment":"5 pages, 7 figures, IEEE Geoscience and Remote Sensing Letters"},{"id":"http://arxiv.org/abs/2310.02692v3","updated":"2024-12-24T06:40:03Z","published":"2023-10-04T10:03:07Z","title":"Clustering-based Image-Text Graph Matching for Domain Generalization","summary":" Learning domain-invariant visual representations is important to train a\nmodel that can generalize well to unseen target task domains. Recent works\ndemonstrate that text descriptions contain high-level class-discriminative\ninformation and such auxiliary semantic cues can be used as effective pivot\nembedding for domain generalization problems. However, they use pivot embedding\nin a global manner (i.e., aligning an image embedding with sentence-level text\nembedding), which does not fully utilize the semantic cues of given text\ndescription. In this work, we advocate for the use of local alignment between\nimage regions and corresponding textual descriptions to get domain-invariant\nfeatures. To this end, we first represent image and text inputs as graphs. We\nthen cluster nodes within these graphs and match the graph-based image node\nfeatures to the nodes of textual graphs. This matching process is conducted\nboth globally and locally, tightly aligning visual and textual semantic\nsub-structures. We experiment with large-scale public datasets, such as CUB-DG\nand DomainBed, and our model achieves matched or better state-of-the-art\nperformance on these datasets. The code is available at:\nhttps://github.com/noparkee/Graph-Clustering-based-DG\n","authors":["Nokyung Park","Daewon Chae","Jeongyong Shim","Sangpil Kim","Eun-Sol Kim","Jinkyu Kim"],"pdf_url":"https://arxiv.org/pdf/2310.02692v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18204v1","updated":"2024-12-24T06:20:01Z","published":"2024-12-24T06:20:01Z","title":"BoxMAC -- A Boxing Dataset for Multi-label Action Classification","summary":" In competitive combat sports like boxing, analyzing a boxers's performance\nstatics is crucial for evaluating the quantity and variety of punches delivered\nduring bouts. These statistics provide valuable data and feedback, which are\nroutinely used for coaching and performance enhancement. We introduce BoxMAC, a\nreal-world boxing dataset featuring 15 professional boxers and encompassing 13\ndistinct action labels. Comprising over 60,000 frames, our dataset has been\nmeticulously annotated for multiple actions per frame with inputs from a boxing\ncoach. Since two boxers can execute different punches within a single\ntimestamp, this problem falls under the domain of multi-label action\nclassification. We propose a novel architecture for jointly recognizing\nmultiple actions in both individual images and videos. We investigate baselines\nusing deep neural network architectures to address both tasks. We believe that\nBoxMAC will enable researchers and practitioners to develop and evaluate more\nefficient models for performance analysis. With its realistic and diverse\nnature, BoxMAC can serve as a valuable resource for the advancement of boxing\nas a sport\n","authors":["Shashikanta Sahoo"],"pdf_url":"https://arxiv.org/pdf/2412.18204v1.pdf","comment":"10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.18199v1","updated":"2024-12-24T06:09:33Z","published":"2024-12-24T06:09:33Z","title":"Leveraging Deep Learning with Multi-Head Attention for Accurate\n Extraction of Medicine from Handwritten Prescriptions","summary":" Extracting medication names from handwritten doctor prescriptions is\nchallenging due to the wide variability in handwriting styles and prescription\nformats. This paper presents a robust method for extracting medicine names\nusing a combination of Mask R-CNN and Transformer-based Optical Character\nRecognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A\nnovel dataset, featuring diverse handwritten prescriptions from various regions\nof Pakistan, was utilized to fine-tune the model on different handwriting\nstyles. The Mask R-CNN model segments the prescription images to focus on the\nmedicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and\nPositional Embeddings, transcribes the isolated text. The transcribed text is\nthen matched against a pre-existing database for accurate identification. The\nproposed approach achieved a character error rate (CER) of 1.4% on standard\nbenchmarks, highlighting its potential as a reliable and efficient tool for\nautomating medicine name extraction.\n","authors":["Usman Ali","Sahil Ranmbail","Muhammad Nadeem","Hamid Ishfaq","Muhammad Umer Ramzan","Waqas Ali"],"pdf_url":"https://arxiv.org/pdf/2412.18199v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18194v1","updated":"2024-12-24T06:03:42Z","published":"2024-12-24T06:03:42Z","title":"VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics\n Manipulation with Long-Horizon Reasoning Tasks","summary":" General-purposed embodied agents are designed to understand the users'\nnatural instructions or intentions and act precisely to complete universal\ntasks. Recently, methods based on foundation models especially\nVision-Language-Action models (VLAs) have shown a substantial potential to\nsolve language-conditioned manipulation (LCM) tasks well. However, existing\nbenchmarks do not adequately meet the needs of VLAs and relative algorithms. To\nbetter define such general-purpose tasks in the context of LLMs and advance the\nresearch in VLAs, we present VLABench, an open-source benchmark for evaluating\nuniversal LCM task learning. VLABench provides 100 carefully designed\ncategories of tasks, with strong randomization in each category of task and a\ntotal of 2000+ objects. VLABench stands out from previous benchmarks in four\nkey aspects: 1) tasks requiring world knowledge and common sense transfer, 2)\nnatural language instructions with implicit human intentions rather than\ntemplates, 3) long-horizon tasks demanding multi-step reasoning, and 4)\nevaluation of both action policies and language model capabilities. The\nbenchmark assesses multiple competencies including understanding of\nmesh\\&texture, spatial relationship, semantic instruction, physical laws,\nknowledge transfer and reasoning, etc. To support the downstream finetuning, we\nprovide high-quality training data collected via an automated framework\nincorporating heuristic skills and prior information. The experimental results\nindicate that both the current state-of-the-art pretrained VLAs and the\nworkflow based on VLMs face challenges in our tasks.\n","authors":["Shiduo Zhang","Zhe Xu","Peiju Liu","Xiaopeng Yu","Yuan Li","Qinghui Gao","Zhaoye Fei","Zhangyue Yin","Zuxuan Wu","Yu-Gang Jiang","Xipeng Qiu"],"pdf_url":"https://arxiv.org/pdf/2412.18194v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.01708v5","updated":"2024-12-24T05:57:39Z","published":"2022-10-04T16:08:54Z","title":"Exploring Parameter-Efficient Fine-Tuning to Enable Foundation Models in\n Federated Learning","summary":" Federated learning (FL) has emerged as a promising paradigm for enabling the\ncollaborative training of models without centralized access to the raw data on\nlocal devices. In the typical FL paradigm (e.g., FedAvg), model weights are\nsent to and from the server each round to participating clients. Recently, the\nuse of small pre-trained models has been shown to be effective in federated\nlearning optimization and improving convergence. However, recent\nstate-of-the-art pre-trained models are getting more capable but also have more\nparameters, known as the \"Foundation Models.\" In conventional FL, sharing the\nenormous model weights can quickly put a massive communication burden on the\nsystem, especially if more capable models are employed. Can we find a solution\nto enable those strong and readily available pre-trained models in FL to\nachieve excellent performance while simultaneously reducing the communication\nburden? To this end, we investigate the use of parameter-efficient fine-tuning\nin federated learning and thus introduce a new framework: FedPEFT.\nSpecifically, we systemically evaluate the performance of FedPEFT across a\nvariety of client stability, data distribution, and differential privacy\nsettings. By only locally tuning and globally sharing a small portion of the\nmodel weights, significant reductions in the total communication overhead can\nbe achieved while maintaining competitive or even better performance in a wide\nrange of federated learning scenarios, providing insight into a new paradigm\nfor practical and effective federated systems.\n","authors":["Guangyu Sun","Umar Khalid","Matias Mendieta","Pu Wang","Chen Chen"],"pdf_url":"https://arxiv.org/pdf/2210.01708v5.pdf","comment":"Published in 2024 IEEE International Conference on Big Data"},{"id":"http://arxiv.org/abs/2412.18185v1","updated":"2024-12-24T05:38:45Z","published":"2024-12-24T05:38:45Z","title":"TextMatch: Enhancing Image-Text Consistency Through Multimodal\n Optimization","summary":" Text-to-image generative models excel in creating images from text but\nstruggle with ensuring alignment and consistency between outputs and prompts.\nThis paper introduces TextMatch, a novel framework that leverages multimodal\noptimization to address image-text discrepancies in text-to-image (T2I)\ngeneration and editing. TextMatch employs a scoring strategy powered by large\nlanguage models (LLMs) and visual question-answering (VQA) models to evaluate\nsemantic consistency between prompts and generated images. By integrating\nmultimodal in-context learning and chain of thought reasoning, our method\ndynamically refines prompts through iterative optimization. This process\nensures that the generated images better capture user intent of, resulting in\nhigher fidelity and relevance. Extensive experiments demonstrate that TextMatch\nsignificantly improves text-image consistency across multiple benchmarks,\nestablishing a reliable framework for advancing the capabilities of\ntext-to-image generative models. Our code is available at\nhttps://anonymous.4open.science/r/TextMatch-F55C/.\n","authors":["Yucong Luo","Mingyue Cheng","Jie Ouyang","Xiaoyu Tao","Qi Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18178v1","updated":"2024-12-24T05:27:11Z","published":"2024-12-24T05:27:11Z","title":"VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis","summary":" Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two\ndominant models for image analysis. While CNNs excel at extracting multi-scale\nfeatures and ViTs effectively capture global dependencies, both suffer from\nhigh computational costs, particularly when processing high-resolution images.\nRecently, state-space models (SSMs) and recurrent neural networks (RNNs) have\nattracted attention due to their efficiency. However, their performance in\nimage classification tasks remains limited. To address these challenges, this\npaper introduces VisionGRU, a novel RNN-based architecture designed for\nefficient image classification. VisionGRU leverages a simplified Gated\nRecurrent Unit (minGRU) to process large-scale image features with linear\ncomplexity. It divides images into smaller patches and progressively reduces\nthe sequence length while increasing the channel depth, thus facilitating\nmulti-scale feature extraction. A hierarchical 2DGRU module with bidirectional\nscanning captures both local and global contexts, improving long-range\ndependency modeling, particularly for tasks like semantic segmentation.\nExperimental results on the ImageNet and ADE20K datasets demonstrate that\nVisionGRU outperforms ViTs, significantly reducing memory usage and\ncomputational costs, especially for high-resolution images. These findings\nunderscore the potential of RNN-based approaches for developing efficient and\nscalable computer vision solutions. Codes will be available at\nhttps://github.com/YangLiu9208/VisionGRU.\n","authors":["Shicheng Yin","Kaixuan Yin","Weixing Chen","Enbo Huang","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18178v1.pdf","comment":"Codes will be available at https://github.com/YangLiu9208/VisionGRU"},{"id":"http://arxiv.org/abs/2412.18177v1","updated":"2024-12-24T05:25:21Z","published":"2024-12-24T05:25:21Z","title":"Enhancing Online Continual Learning with Plug-and-Play State Space Model\n and Class-Conditional Mixture of Discretization","summary":" Online continual learning (OCL) seeks to learn new tasks from data streams\nthat appear only once, while retaining knowledge of previously learned tasks.\nMost existing methods rely on replay, focusing on enhancing memory retention\nthrough regularization or distillation. However, they often overlook the\nadaptability of the model, limiting the ability to learn generalizable and\ndiscriminative features incrementally from online training data. To address\nthis, we introduce a plug-and-play module, S6MOD, which can be integrated into\nmost existing methods and directly improve adaptability. Specifically, S6MOD\nintroduces an extra branch after the backbone, where a mixture of\ndiscretization selectively adjusts parameters in a selective state space model,\nenriching selective scan patterns such that the model can adaptively select the\nmost sensitive discretization method for current dynamics. We further design a\nclass-conditional routing algorithm for dynamic, uncertainty-based adjustment\nand implement a contrastive discretization loss to optimize it. Extensive\nexperiments combining our module with various models demonstrate that S6MOD\nsignificantly enhances model adaptability, leading to substantial performance\ngains and achieving the state-of-the-art results.\n","authors":["Sihao Liu","Yibo Yang","Xiaojie Li","David A. Clifton","Bernard Ghanem"],"pdf_url":"https://arxiv.org/pdf/2412.18177v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.20898v2","updated":"2024-12-24T05:22:40Z","published":"2024-10-28T10:26:19Z","title":"Diff-Instruct*: Towards Human-Preferred One-step Text-to-image\n Generative Models","summary":" In this paper, we introduce the Diff-Instruct* (DI*), an image data-free\napproach for building one-step text-to-image generative models that align with\nhuman preference while maintaining the ability to generate highly realistic\nimages. We frame human preference alignment as online reinforcement learning\nusing human feedback (RLHF), where the goal is to maximize the reward function\nwhile regularizing the generator distribution to remain close to a reference\ndiffusion process. Unlike traditional RLHF approaches, which rely on the KL\ndivergence for regularization, we introduce a novel score-based divergence\nregularization, which leads to significantly better performances. Although the\ndirect calculation of this preference alignment objective remains intractable,\nwe demonstrate that we can efficiently compute its gradient by deriving an\nequivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to\ntrain a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step\ntext-to-image model, which can generate images of a resolution of 1024x1024\nwith only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference\ntime and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly\nin PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1\non Human Preference Score benchmark, establishing a new state-of-the-art\nbenchmark of human-preferred 1-step text-to-image generative models. Besides\nthe strong quantitative performances, extensive qualitative comparisons also\nconfirm the advantages of DI* in terms of maintaining diversity, improving\nimage layouts, and enhancing aesthetic colors. We have released our\nindustry-ready model on the homepage:\n\\url{https://github.com/pkulwj1994/diff_instruct_star}.\n","authors":["Weijian Luo","Colin Zhang","Debing Zhang","Zhengyang Geng"],"pdf_url":"https://arxiv.org/pdf/2410.20898v2.pdf","comment":"revision: 2.6B 1-step text-to-image model outperforms 12B\n Flux-dev-50step model in human preferences"},{"id":"http://arxiv.org/abs/2411.14868v4","updated":"2024-12-24T05:10:03Z","published":"2024-11-22T11:34:18Z","title":"Defective Edge Detection Using Cascaded Ensemble Canny Operator","summary":" Edge detection has been one of the most difficult challenges in computer\nvision because of the difficulty in identifying the borders and edges from the\nreal-world images including objects of varying kinds and sizes. Methods based\non ensemble learning, which use a combination of backbones and attention\nmodules, outperformed more conventional approaches, such as Sobel and Canny\nedge detection. Nevertheless, these algorithms are still challenged when faced\nwith complicated scene photos. In addition, the identified edges utilizing the\ncurrent methods are not refined and often include incorrect edges. In this\nwork, we used a Cascaded Ensemble Canny operator to solve these problems and\ndetect the object edges. The most difficult Fresh and Rotten and Berkeley\ndatasets are used to test the suggested approach in Python. In terms of\nperformance metrics and output picture quality, the acquired results outperform\nthe specified edge detection networks\n","authors":["Anjali Nambiyar Rajkumar Kannan"],"pdf_url":"https://arxiv.org/pdf/2411.14868v4.pdf","comment":"2 Pages and 2 Figures"},{"id":"http://arxiv.org/abs/2410.14919v4","updated":"2024-12-24T05:06:20Z","published":"2024-10-19T00:33:51Z","title":"Adversarial Score identity Distillation: Rapidly Surpassing the Teacher\n in One Step","summary":" Score identity Distillation (SiD) is a data-free method that has achieved\nSOTA performance in image generation by leveraging only a pretrained diffusion\nmodel, without requiring any training data. However, its ultimate performance\nis constrained by how accurate the pretrained model captures the true data\nscores at different stages of the diffusion process. In this paper, we\nintroduce SiDA (SiD with Adversarial Loss), which not only enhances generation\nquality but also improves distillation efficiency by incorporating real images\nand adversarial loss. SiDA utilizes the encoder from the generator's score\nnetwork as a discriminator, allowing it to distinguish between real images and\nthose generated by SiD. The adversarial loss is batch-normalized within each\nGPU and then combined with the original SiD loss. This integration effectively\nincorporates the average \"fakeness\" per GPU batch into the pixel-based SiD\nloss, enabling SiDA to distill a single-step generator. SiDA converges\nsignificantly faster than its predecessor when distilled from scratch, and\nswiftly improves upon the original model's performance during fine-tuning from\na pre-distilled SiD generator. This one-step adversarial distillation method\nestablishes new benchmarks in generation performance when distilling EDM\ndiffusion models, achieving FID scores of 1.110 on ImageNet 64x64. When\ndistilling EDM2 models trained on ImageNet 512x512, our SiDA method surpasses\neven the largest teacher model, EDM2-XXL, which achieved an FID of 1.81 using\nclassifier-free guidance (CFG) and 63 generation steps. In contrast, SiDA\nachieves FID scores of 2.156 for size XS, 1.669 for S, 1.488 for M, 1.413 for\nL, 1.379 for XL, and 1.366 for XXL, all without CFG and in a single generation\nstep. These results highlight substantial improvements across all model sizes.\nOur code is available at https://github.com/mingyuanzhou/SiD/tree/sida.\n","authors":["Mingyuan Zhou","Huangjie Zheng","Yi Gu","Zhendong Wang","Hai Huang"],"pdf_url":"https://arxiv.org/pdf/2410.14919v4.pdf","comment":"10 pages (main text), 34 figures, and 10 tables"},{"id":"http://arxiv.org/abs/2412.18165v1","updated":"2024-12-24T04:56:32Z","published":"2024-12-24T04:56:32Z","title":"Parallel Neural Computing for Scene Understanding from LiDAR Perception\n in Autonomous Racing","summary":" Autonomous driving in high-speed racing, as opposed to urban environments,\npresents significant challenges in scene understanding due to rapid changes in\nthe track environment. Traditional sequential network approaches may struggle\nto meet the real-time knowledge and decision-making demands of an autonomous\nagent covering large displacements in a short time. This paper proposes a novel\nbaseline architecture for developing sophisticated models capable of true\nhardware-enabled parallelism, achieving neural processing speeds that mirror\nthe agent's high velocity. The proposed model (Parallel Perception Network\n(PPN)) consists of two independent neural networks, segmentation and\nreconstruction networks, running parallelly on separate accelerated hardware.\nThe model takes raw 3D point cloud data from the LiDAR sensor as input and\nconverts it into a 2D Bird's Eye View Map on both devices. Each network\nindependently extracts its input features along space and time dimensions and\nproduces outputs parallelly. The proposed method's model is trained on a system\nwith two NVIDIA T4 GPUs, using a combination of loss functions, including edge\npreservation, and demonstrates a 2x speedup in model inference time compared to\na sequential configuration. Implementation is available at:\nhttps://github.com/suwesh/Parallel-Perception-Network. Learned parameters of\nthe trained networks are provided at:\nhttps://huggingface.co/suwesh/ParallelPerceptionNetwork.\n","authors":["Suwesh Prasad Sah"],"pdf_url":"https://arxiv.org/pdf/2412.18165v1.pdf","comment":"IEEE/ISED 2024"},{"id":"http://arxiv.org/abs/2407.02068v4","updated":"2024-12-24T04:45:51Z","published":"2024-07-02T08:58:19Z","title":"LPViT: Low-Power Semi-structured Pruning for Vision Transformers","summary":" Vision transformers have emerged as a promising alternative to convolutional\nneural networks for various image analysis tasks, offering comparable or\nsuperior performance. However, one significant drawback of ViTs is their\nresource-intensive nature, leading to increased memory footprint, computation\ncomplexity, and power consumption. To democratize this high-performance\ntechnology and make it more environmentally friendly, it is essential to\ncompress ViT models, reducing their resource requirements while maintaining\nhigh performance. In this paper, we introduce a new block-structured pruning to\naddress the resource-intensive issue for ViTs, offering a balanced trade-off\nbetween accuracy and hardware acceleration. Unlike unstructured pruning or\nchannel-wise structured pruning, block pruning leverages the block-wise\nstructure of linear layers, resulting in more efficient matrix multiplications.\nTo optimize this pruning scheme, our paper proposes a novel hardware-aware\nlearning objective that simultaneously maximizes speedup and minimizes power\nconsumption during inference, tailored to the block sparsity structure. This\nobjective eliminates the need for empirical look-up tables and focuses solely\non reducing parametrized layer connections. Moreover, our paper provides a\nlightweight algorithm to achieve post-training pruning for ViTs, utilizing\nsecond-order Taylor approximation and empirical optimization to solve the\nproposed hardware-aware objective. Extensive experiments on ImageNet are\nconducted across various ViT architectures, including DeiT-B and DeiT-S,\ndemonstrating competitive performance with other pruning methods and achieving\na remarkable balance between accuracy preservation and power savings.\nEspecially, we achieve up to 3.93x and 1.79x speedups on dedicated hardware and\nGPUs respectively for DeiT-B, and also observe an inference power reduction by\n1.4x on real-world GPUs.\n","authors":["Kaixin Xu","Zhe Wang","Chunyun Chen","Xue Geng","Jie Lin","Mohamed M. Sabry Aly","Xulei Yang","Min Wu","Xiaoli Li","Weisi Lin"],"pdf_url":"https://arxiv.org/pdf/2407.02068v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.01870v2","updated":"2024-12-24T04:44:24Z","published":"2024-11-04T07:57:44Z","title":"Mining and Transferring Feature-Geometry Coherence for Unsupervised\n Point Cloud Registration","summary":" Point cloud registration, a fundamental task in 3D vision, has achieved\nremarkable success with learning-based methods in outdoor environments.\nUnsupervised outdoor point cloud registration methods have recently emerged to\ncircumvent the need for costly pose annotations. However, they fail to\nestablish reliable optimization objectives for unsupervised training, either\nrelying on overly strong geometric assumptions, or suffering from poor-quality\npseudo-labels due to inadequate integration of low-level geometric and\nhigh-level contextual information. We have observed that in the feature space,\nlatent new inlier correspondences tend to cluster around respective positive\nanchors that summarize features of existing inliers. Motivated by this\nobservation, we propose a novel unsupervised registration method termed INTEGER\nto incorporate high-level contextual information for reliable pseudo-label\nmining. Specifically, we propose the Feature-Geometry Coherence Mining module\nto dynamically adapt the teacher for each mini-batch of data during training\nand discover reliable pseudo-labels by considering both high-level feature\nrepresentations and low-level geometric cues. Furthermore, we propose\nAnchor-Based Contrastive Learning to facilitate contrastive learning with\nanchors for a robust feature space. Lastly, we introduce a Mixed-Density\nStudent to learn density-invariant features, addressing challenges related to\ndensity variation and low overlap in the outdoor scenario. Extensive\nexperiments on KITTI and nuScenes datasets demonstrate that our INTEGER\nachieves competitive performance in terms of accuracy and generalizability.\n","authors":["Kezheng Xiong","Haoen Xiang","Qingshan Xu","Chenglu Wen","Siqi Shen","Jonathan Li","Cheng Wang"],"pdf_url":"https://arxiv.org/pdf/2411.01870v2.pdf","comment":"Accepted by NeurIPS2024"},{"id":"http://arxiv.org/abs/2412.14058v3","updated":"2024-12-24T04:43:45Z","published":"2024-12-18T17:07:20Z","title":"Towards Generalist Robot Policies: What Matters in Building\n Vision-Language-Action Models","summary":" Foundation Vision Language Models (VLMs) exhibit strong capabilities in\nmulti-modal representation learning, comprehension, and reasoning. By injecting\naction components into the VLMs, Vision-Language-Action Models (VLAs) can be\nnaturally formed and also show promising performance. Existing work has\ndemonstrated the effectiveness and generalization of VLAs in multiple scenarios\nand tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since\nexisting VLAs differ in their backbones, action-prediction formulations, data\ndistributions, and training recipes. This leads to a missing piece for a\nsystematic understanding of the design choices of VLAs. In this work, we\ndisclose the key factors that significantly influence the performance of VLA\nand focus on answering three essential design choices: which backbone to\nselect, how to formulate the VLA architectures, and when to add\ncross-embodiment data. The obtained results convince us firmly to explain why\nwe need VLA and develop a new family of VLAs, RoboVLMs, which require very few\nmanual designs and achieve a new state-of-the-art performance in three\nsimulation tasks and real-world experiments. Through our extensive experiments,\nwhich include over 8 VLM backbones, 4 policy architectures, and over 600\ndistinct designed experiments, we provide a detailed guidebook for the future\ndesign of VLAs. In addition to the study, the highly flexible RoboVLMs\nframework, which supports easy integrations of new VLMs and free combinations\nof various design choices, is made public to facilitate future research. We\nopen-source all details, including codes, models, datasets, and toolkits, along\nwith detailed training and evaluation recipes at: robovlms.github.io.\n","authors":["Xinghang Li","Peiyan Li","Minghuan Liu","Dong Wang","Jirong Liu","Bingyi Kang","Xiao Ma","Tao Kong","Hanbo Zhang","Huaping Liu"],"pdf_url":"https://arxiv.org/pdf/2412.14058v3.pdf","comment":"Project page: robovlms.github.io. Added limitations and future works.\n Fix categorization"},{"id":"http://arxiv.org/abs/2412.15484v2","updated":"2024-12-24T04:42:49Z","published":"2024-12-20T01:37:22Z","title":"Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and\n Dual Evaluation Metrics for Factuality and Coverage","summary":" Multimodal large language models (MLLMs) excel at generating highly detailed\ncaptions but often produce hallucinations. Our analysis reveals that existing\nhallucination detection methods struggle with detailed captions. We attribute\nthis to the increasing reliance of MLLMs on their generated text, rather than\nthe input image, as the sequence length grows. To address this issue, we\npropose a multiagent approach that leverages LLM-MLLM collaboration to correct\ngiven captions. Additionally, we introduce an evaluation framework and a\nbenchmark dataset to facilitate the systematic analysis of detailed captions.\nOur experiments demonstrate that our proposed evaluation method better aligns\nwith human judgments of factuality than existing metrics and that existing\napproaches to improve the MLLM factuality may fall short in hyper-detailed\nimage captioning tasks. In contrast, our proposed method significantly enhances\nthe factual accuracy of captions, even improving those generated by GPT-4V.\nFinally, we highlight a limitation of VQA-centric benchmarking by demonstrating\nthat an MLLM's performance on VQA benchmarks may not correlate with its ability\nto generate detailed image captions.\n","authors":["Saehyung Lee","Seunghyun Yoon","Trung Bui","Jing Shi","Sungroh Yoon"],"pdf_url":"https://arxiv.org/pdf/2412.15484v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18160v1","updated":"2024-12-24T04:36:35Z","published":"2024-12-24T04:36:35Z","title":"Image Quality Assessment: Exploring Regional Heterogeneity via Response\n of Adaptive Multiple Quality Factors in Dictionary Space","summary":" Given that the factors influencing image quality vary significantly with\nscene, content, and distortion type, particularly in the context of regional\nheterogeneity, we propose an adaptive multi-quality factor (AMqF) framework to\nrepresent image quality in a dictionary space, enabling the precise capture of\nquality features in non-uniformly distorted regions. By designing an adapter,\nthe framework can flexibly decompose quality factors (such as brightness,\nstructure, contrast, etc.) that best align with human visual perception and\nquantify them into discrete visual words. These visual words respond to the\nconstructed dictionary basis vector, and by obtaining the corresponding\ncoordinate vectors, we can measure visual similarity. Our method offers two key\ncontributions. First, an adaptive mechanism that extracts and decomposes\nquality factors according to human visual perception principles enhances their\nrepresentation ability through reconstruction constraints. Second, the\nconstruction of a comprehensive and discriminative dictionary space and basis\nvector allows quality factors to respond effectively to the dictionary basis\nvector and capture non-uniform distortion patterns in images, significantly\nimproving the accuracy of visual similarity measurement. The experimental\nresults demonstrate that the proposed method outperforms existing\nstate-of-the-art approaches in handling various types of distorted images. The\nsource code is available at https://anonymous.4open.science/r/AMqF-44B2.\n","authors":["Xuting Lan","Mingliang Zhou","Jielu Yan","Xuekai Wei","Yueting Huang","Zhaowei Shang","Huayan Pu"],"pdf_url":"https://arxiv.org/pdf/2412.18160v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18158v1","updated":"2024-12-24T04:32:36Z","published":"2024-12-24T04:32:36Z","title":"Semantics Disentanglement and Composition for Versatile Codec toward\n both Human-eye Perception and Machine Vision Task","summary":" While learned image compression methods have achieved impressive results in\neither human visual perception or machine vision tasks, they are often\nspecialized only for one domain. This drawback limits their versatility and\ngeneralizability across scenarios and also requires retraining to adapt to new\napplications-a process that adds significant complexity and cost in real-world\nscenarios. In this study, we introduce an innovative semantics DISentanglement\nand COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye\nperception and machine vision tasks. The approach derives a set of labels per\ntask through multimodal large models, which grounding models are then applied\nfor precise localization, enabling a comprehensive understanding and\ndisentanglement of image components at the encoder side. At the decoding stage,\na comprehensive reconstruction of the image is achieved by leveraging these\nencoded components alongside priors from generative models, thereby optimizing\nperformance for both human visual perception and machine-based analytical\ntasks. Extensive experimental evaluations substantiate the robustness and\neffectiveness of DISCOVER, demonstrating superior performance in fulfilling the\ndual objectives of human and machine vision requirements.\n","authors":["Jinming Liu","Yuntao Wei","Junyan Lin","Shengyang Zhao","Heming Sun","Zhibo Chen","Wenjun Zeng","Xin Jin"],"pdf_url":"https://arxiv.org/pdf/2412.18158v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2305.14014v4","updated":"2024-12-24T04:27:37Z","published":"2023-05-23T12:51:20Z","title":"CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained\n Vision-Language Model","summary":" Pre-trained vision-language models~(VLMs) are the de-facto foundation models\nfor various downstream tasks. However, scene text recognition methods still\nprefer backbones pre-trained on a single modality, namely, the visual modality,\ndespite the potential of VLMs to serve as powerful scene text readers. For\nexample, CLIP can robustly identify regular (horizontal) and irregular\n(rotated, curved, blurred, or occluded) text in images. With such merits, we\ntransform CLIP into a scene text reader and introduce CLIP4STR, a simple yet\neffective STR method built upon image and text encoders of CLIP. It has two\nencoder-decoder branches: a visual branch and a cross-modal branch. The visual\nbranch provides an initial prediction based on the visual feature, and the\ncross-modal branch refines this prediction by addressing the discrepancy\nbetween the visual feature and text semantics. To fully leverage the\ncapabilities of both branches, we design a dual predict-and-refine decoding\nscheme for inference. We scale CLIP4STR in terms of the model size,\npre-training data, and training data, achieving state-of-the-art performance on\n13 STR benchmarks. Additionally, a comprehensive empirical study is provided to\nenhance the understanding of the adaptation of CLIP to STR. Our method\nestablishes a simple yet strong baseline for future STR research with VLMs.\n","authors":["Shuai Zhao","Ruijie Quan","Linchao Zhu","Yi Yang"],"pdf_url":"https://arxiv.org/pdf/2305.14014v4.pdf","comment":"Accepted by T-IP. A PyTorch re-implementation is at\n https://github.com/VamosC/CLIP4STR (Credit on GitHub@VamosC)"},{"id":"http://arxiv.org/abs/2411.02799v3","updated":"2024-12-24T04:24:27Z","published":"2024-11-05T04:20:06Z","title":"ERUP-YOLO: Enhancing Object Detection Robustness for Adverse Weather\n Condition by Unified Image-Adaptive Processing","summary":" We propose an image-adaptive object detection method for adverse weather\nconditions such as fog and low-light. Our framework employs differentiable\npreprocessing filters to perform image enhancement suitable for later-stage\nobject detections. Our framework introduces two differentiable filters: a\nB\\'ezier curve-based pixel-wise (BPW) filter and a kernel-based local (KBL)\nfilter. These filters unify the functions of classical image processing filters\nand improve performance of object detection. We also propose a domain-agnostic\ndata augmentation strategy using the BPW filter. Our method does not require\ndata-specific customization of the filter combinations, parameter ranges, and\ndata augmentation. We evaluate our proposed approach, called Enhanced\nRobustness by Unified Image Processing (ERUP)-YOLO, by applying it to the\nYOLOv3 detector. Experiments on adverse weather datasets demonstrate that our\nproposed filters match or exceed the expressiveness of conventional methods and\nour ERUP-YOLO achieved superior performance in a wide range of adverse weather\nconditions, including fog and low-light conditions.\n","authors":["Yuka Ogino","Yuho Shoji","Takahiro Toizumi","Atsushi Ito"],"pdf_url":"https://arxiv.org/pdf/2411.02799v3.pdf","comment":"Accepted to WACV 2025"},{"id":"http://arxiv.org/abs/2410.15446v2","updated":"2024-12-24T04:23:50Z","published":"2024-10-20T16:52:09Z","title":"Concept Complement Bottleneck Model for Interpretable Medical Image\n Diagnosis","summary":" Models based on human-understandable concepts have received extensive\nattention to improve model interpretability for trustworthy artificial\nintelligence in the field of medical image analysis. These methods can provide\nconvincing explanations for model decisions but heavily rely on the detailed\nannotation of pre-defined concepts. Consequently, they may not be effective in\ncases where concepts or annotations are incomplete or low-quality. Although\nsome methods automatically discover effective and new visual concepts rather\nthan using pre-defined concepts or could find some human-understandable\nconcepts via large Language models, they are prone to veering away from medical\ndiagnostic evidence and are challenging to understand. In this paper, we\npropose a concept complement bottleneck model for interpretable medical image\ndiagnosis with the aim of complementing the existing concept set and finding\nnew concepts bridging the gap between explainable models. Specifically, we\npropose to use concept adapters for specific concepts to mine the concept\ndifferences and score concepts in their own attention channels to support\nalmost fairly concept learning. Then, we devise a concept complement strategy\nto learn new concepts while jointly using known concepts to improve model\nperformance. Comprehensive experiments on medical datasets demonstrate that our\nmodel outperforms the state-of-the-art competitors in concept detection and\ndisease diagnosis tasks while providing diverse explanations to ensure model\ninterpretability effectively.\n","authors":["Hongmei Wang","Junlin Hou","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2410.15446v2.pdf","comment":"27 pages, 5 figures,"},{"id":"http://arxiv.org/abs/2412.17153v2","updated":"2024-12-24T04:21:15Z","published":"2024-12-22T20:21:54Z","title":"Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models\n with Flow Matching","summary":" Autoregressive (AR) models have achieved state-of-the-art performance in text\nand image generation but suffer from slow generation due to the token-by-token\nprocess. We ask an ambitious question: can a pre-trained AR model be adapted to\ngenerate outputs in just one or two steps? If successful, this would\nsignificantly advance the development and deployment of AR models. We notice\nthat existing works that try to speed up AR generation by generating multiple\ntokens at once fundamentally cannot capture the output distribution due to the\nconditional dependencies between tokens, limiting their effectiveness for\nfew-step generation. To address this, we propose Distilled Decoding (DD), which\nuses flow matching to create a deterministic mapping from Gaussian distribution\nto the output distribution of the pre-trained AR model. We then train a network\nto distill this mapping, enabling few-step generation. DD doesn't need the\ntraining data of the original AR model, making it more practical. We evaluate\nDD on state-of-the-art image AR models and present promising results on\nImageNet-256. For VAR, which requires 10-step generation, DD enables one-step\ngeneration (6.3$\\times$ speed-up), with an acceptable increase in FID from 4.19\nto 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an\n217.8$\\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In\nboth cases, baseline methods completely fail with FID>100. DD also excels on\ntext-to-image generation, reducing the generation from 256 steps to 2 for\nLlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to\ndemonstrate the possibility of one-step generation for image AR models, DD\nchallenges the prevailing notion that AR models are inherently slow, and opens\nup new opportunities for efficient AR generation. The project website is at\nhttps://imagination-research.github.io/distilled-decoding.\n","authors":["Enshu Liu","Xuefei Ning","Yu Wang","Zinan Lin"],"pdf_url":"https://arxiv.org/pdf/2412.17153v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18153v1","updated":"2024-12-24T04:16:38Z","published":"2024-12-24T04:16:38Z","title":"DepthLab: From Partial to Complete","summary":" Missing values remain a common challenge for depth data across its wide range\nof applications, stemming from various causes like incomplete data acquisition\nand perspective alteration. This work bridges this gap with DepthLab, a\nfoundation depth inpainting model powered by image diffusion priors. Our model\nfeatures two notable strengths: (1) it demonstrates resilience to\ndepth-deficient regions, providing reliable completion for both continuous\nareas and isolated points, and (2) it faithfully preserves scale consistency\nwith the conditioned known depth when filling in missing values. Drawing on\nthese advantages, our approach proves its worth in various downstream tasks,\nincluding 3D scene inpainting, text-to-3D scene generation, sparse-view\nreconstruction with DUST3R, and LiDAR depth completion, exceeding current\nsolutions in both numerical performance and visual quality. Our project page\nwith source code is available at https://johanan528.github.io/depthlab_web/.\n","authors":["Zhiheng Liu","Ka Leong Cheng","Qiuyu Wang","Shuzhe Wang","Hao Ouyang","Bin Tan","Kai Zhu","Yujun Shen","Qifeng Chen","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2412.18153v1.pdf","comment":"Project page and code: https://johanan528.github.io/depthlab_web/"},{"id":"http://arxiv.org/abs/2412.11464v2","updated":"2024-12-24T04:13:08Z","published":"2024-12-16T05:44:45Z","title":"MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary\n Image Segmentation","summary":" Open-vocabulary image segmentation has been advanced through the synergy\nbetween mask generators and vision-language models like Contrastive\nLanguage-Image Pre-training (CLIP). Previous approaches focus on generating\nmasks while aligning mask features with text embeddings during training. In\nthis paper, we observe that relying on generated low-quality masks can weaken\nthe alignment of vision and language in regional representations. This\nmotivates us to present a new fine-tuning framework, named MaskCLIP++, which\nuses ground-truth masks instead of generated masks to enhance the mask\nclassification capability of CLIP. Due to the limited diversity of image\nsegmentation datasets with mask annotations, we propose incorporating a\nconsistency alignment constraint during fine-tuning, which alleviates\ncategorical bias toward the fine-tuning dataset. After low-cost fine-tuning,\ncombining with the mask generator in previous state-of-the-art mask-based open\nvocabulary segmentation methods, we achieve performance improvements of +1.7,\n+2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20\ndatasets, respectively. Code is released at\nhttps://github.com/HVision-NKU/MaskCLIPpp .\n","authors":["Quan-Sheng Zeng","Yunheng Li","Daquan Zhou","Guanbin Li","Qibin Hou","Ming-Ming Cheng"],"pdf_url":"https://arxiv.org/pdf/2412.11464v2.pdf","comment":"20 pages, 8 figures. Add code link"},{"id":"http://arxiv.org/abs/2410.19944v3","updated":"2024-12-24T04:12:18Z","published":"2024-10-25T19:42:57Z","title":"A Multimodal Approach For Endoscopic VCE Image Classification Using\n BiomedCLIP-PubMedBERT","summary":" This Paper presents an advanced approach for fine-tuning BiomedCLIP\nPubMedBERT, a multimodal model, to classify abnormalities in Video Capsule\nEndoscopy (VCE) frames, aiming to enhance diagnostic efficiency in\ngastrointestinal healthcare. By integrating the PubMedBERT language model with\na Vision Transformer (ViT) to process endoscopic images, our method categorizes\nimages into ten specific classes: angioectasia, bleeding, erosion, erythema,\nforeign body, lymphangiectasia, polyp, ulcer, worms, and normal. Our workflow\nincorporates image preprocessing and fine-tunes the BiomedCLIP model to\ngenerate high-quality embeddings for both visual and textual inputs, aligning\nthem through similarity scoring for classification. Performance metrics,\nincluding classification, accuracy, recall, and F1 score, indicate the models\nstrong ability to accurately identify abnormalities in endoscopic frames,\nshowing promise for practical use in clinical diagnostics.\n","authors":["Nagarajan Ganapathy","Podakanti Satyajith Chary","Teja Venkata Ramana Kumar Pithani","Pavan Kavati","Arun Kumar S"],"pdf_url":"https://arxiv.org/pdf/2410.19944v3.pdf","comment":"11 Pages, 2 Figures, Capsule Vision 2024 Challenge"},{"id":"http://arxiv.org/abs/2412.18150v1","updated":"2024-12-24T04:08:25Z","published":"2024-12-24T04:08:25Z","title":"EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive\n Human Annotations for Text-to-Image Generation Model Evaluation","summary":" Recently, Text-to-Image (T2I) generation models have achieved significant\nadvancements. Correspondingly, many automated metrics have emerged to evaluate\nthe image-text alignment capabilities of generative models. However, the\nperformance comparison among these automated metrics is limited by existing\nsmall datasets. Additionally, these datasets lack the capacity to assess the\nperformance of automated metrics at a fine-grained level. In this study, we\ncontribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with\nfine-grained human annotations for image-text alignment-related tasks. In the\nconstruction process, we employ various strategies such as balanced prompt\nsampling and data re-annotation to ensure the diversity and reliability of our\nbenchmark. This allows us to comprehensively evaluate the effectiveness of\nimage-text alignment metrics for T2I models. Meanwhile, we introduce two new\nmethods to evaluate the image-text alignment capabilities of T2I models:\nFGA-BLIP2 which involves end-to-end fine-tuning of a vision-language model to\nproduce fine-grained image-text alignment scores and PN-VQA which adopts a\nnovel positive-negative VQA manner in VQA models for zero-shot fine-grained\nevaluation. Both methods achieve impressive performance in image-text alignment\nevaluations. We also use our methods to rank current AIGC models, in which the\nresults can serve as a reference source for future study and promote the\ndevelopment of T2I generation. The data and code will be made publicly\navailable.\n","authors":["Shuhao Han","Haotian Fan","Jiachen Fu","Liang Li","Tao Li","Junhui Cui","Yunqiu Wang","Yang Tai","Jingwei Sun","Chunle Guo","Chongyi Li"],"pdf_url":"https://arxiv.org/pdf/2412.18150v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.14074v3","updated":"2024-12-24T04:05:27Z","published":"2024-01-25T10:52:36Z","title":"ProCNS: Progressive Prototype Calibration and Noise Suppression for\n Weakly-Supervised Medical Image Segmentation","summary":" Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate\nthe conflict between annotation cost and model performance by adopting sparse\nannotation formats (e.g., point, scribble, block, etc.). Typical approaches\nattempt to exploit anatomy and topology priors to directly expand sparse\nannotations into pseudo-labels. However, due to a lack of attention to the\nambiguous edges in medical images and insufficient exploration of sparse\nsupervision, existing approaches tend to generate erroneous and overconfident\npseudo proposals in noisy regions, leading to cumulative model error and\nperformance degradation. In this work, we propose a novel WSS approach, named\nProCNS, encompassing two synergistic modules devised with the principles of\nprogressive prototype calibration and noise suppression. Specifically, we\ndesign a Prototype-based Regional Spatial Affinity (PRSA) loss to maximize the\npair-wise affinities between spatial and semantic elements, providing our model\nof interest with more reliable guidance. The affinities are derived from the\ninput images and the prototype-refined predictions. Meanwhile, we propose an\nAdaptive Noise Perception and Masking (ANPM) module to obtain more enriched and\nrepresentative prototype representations, which adaptively identifies and masks\nnoisy regions within the pseudo proposals, reducing potential erroneous\ninterference during prototype computation. Furthermore, we generate specialized\nsoft pseudo-labels for the noisy regions identified by ANPM, providing\nsupplementary supervision. Extensive experiments on six medical image\nsegmentation tasks involving different modalities demonstrate that the proposed\nframework significantly outperforms representative state-of-the-art methods.\n","authors":["Y. Liu","L. Lin","K. K. Y. Wong","X. Tang"],"pdf_url":"https://arxiv.org/pdf/2401.14074v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18149v1","updated":"2024-12-24T04:05:21Z","published":"2024-12-24T04:05:21Z","title":"Dense-Face: Personalized Face Generation Model via Dense Annotation\n Prediction","summary":" The text-to-image (T2I) personalization diffusion model can generate images\nof the novel concept based on the user input text caption. However, existing\nT2I personalized methods either require test-time fine-tuning or fail to\ngenerate images that align well with the given text caption. In this work, we\npropose a new T2I personalization diffusion model, Dense-Face, which can\ngenerate face images with a consistent identity as the given reference subject\nand align well with the text caption. Specifically, we introduce a\npose-controllable adapter for the high-fidelity image generation while\nmaintaining the text-based editing ability of the pre-trained stable diffusion\n(SD). Additionally, we use internal features of the SD UNet to predict dense\nface annotations, enabling the proposed method to gain domain knowledge in face\ngeneration. Empirically, our method achieves state-of-the-art or competitive\ngeneration performance in image-text alignment, identity preservation, and pose\ncontrol.\n","authors":["Xiao Guo","Manh Tran","Jiaxin Cheng","Xiaoming Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18149v1.pdf","comment":"15 figures, 5 tables"},{"id":"http://arxiv.org/abs/2412.18147v1","updated":"2024-12-24T04:04:33Z","published":"2024-12-24T04:04:33Z","title":"Accelerating Post-Tornado Disaster Assessment Using Advanced Deep\n Learning Models","summary":" Post-disaster assessments of buildings and infrastructure are crucial for\nboth immediate recovery efforts and long-term resilience planning. This\nresearch introduces an innovative approach to automating post-disaster\nassessments through advanced deep learning models. Our proposed system employs\nstate-of-the-art computer vision techniques (YOLOv11 and ResNet50) to rapidly\nanalyze images and videos from disaster sites, extracting critical information\nabout building characteristics, including damage level of structural components\nand the extent of damage. Our experimental results show promising performance,\nwith ResNet50 achieving 90.28% accuracy and an inference time of 1529ms per\nimage on multiclass damage classification. This study contributes to the field\nof disaster management by offering a scalable, efficient, and objective tool\nfor post-disaster analysis, potentially capable of transforming how communities\nand authorities respond to and learn from catastrophic events.\n","authors":["Robinson Umeike","Thang Dao","Shane Crawford"],"pdf_url":"https://arxiv.org/pdf/2412.18147v1.pdf","comment":"3 pages, 4 Figures, 1 Table"},{"id":"http://arxiv.org/abs/2412.18136v1","updated":"2024-12-24T03:44:26Z","published":"2024-12-24T03:44:26Z","title":"ERVD: An Efficient and Robust ViT-Based Distillation Framework for\n Remote Sensing Image Retrieval","summary":" ERVD: An Efficient and Robust ViT-Based Distillation Framework for Remote\nSensing Image Retrieval\n","authors":["Le Dong","Qixuan Cao","Lei Pu","Fangfang Wu","Weisheng Dong","Xin Li","Guangming Shi"],"pdf_url":"https://arxiv.org/pdf/2412.18136v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18131v1","updated":"2024-12-24T03:40:05Z","published":"2024-12-24T03:40:05Z","title":"UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by\n Regional Visual Language Supervision","summary":" We present UniPLV, a powerful framework that unifies point clouds, images and\ntext in a single learning paradigm for open-world 3D scene understanding.\nUniPLV employs the image modal as a bridge to co-embed 3D points with\npre-aligned images and text in a shared feature space without requiring\ncarefully crafted point cloud text pairs. To accomplish multi-modal alignment,\nwe propose two key strategies:(i) logit and feature distillation modules\nbetween images and point clouds, and (ii) a vison-point matching module is\ngiven to explicitly correct the misalignment caused by points to pixels\nprojection. To further improve the performance of our unified framework, we\nadopt four task-specific losses and a two-stage training strategy. Extensive\nexperiments show that our method outperforms the state-of-the-art methods by an\naverage of 15.6% and 14.8% for semantic segmentation over Base-Annotated and\nAnnotation-Free tasks, respectively. The code will be released later.\n","authors":["Yuru Wang","Songtao Wang","Zehan Zhang","Xinyan Lu","Changwei Cai","Hao Li","Fu Liu","Peng Jia","Xianpeng Lang"],"pdf_url":"https://arxiv.org/pdf/2412.18131v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.08557v5","updated":"2024-12-24T03:38:02Z","published":"2024-03-13T14:08:45Z","title":"OC4-ReID: Occluded Cloth-Changing Person Re-Identification","summary":" The study of Cloth-Changing Person Re-identification (CC-ReID) focuses on\nretrieving specific pedestrians when their clothing has changed, typically\nunder the assumption that the entire pedestrian images are visible. Pedestrian\nimages in real-world scenarios, however, are often partially obscured by\nobstacles, presenting a significant challenge to existing CC-ReID systems. In\nthis paper, we introduce a more challenging task termed Occluded Cloth-Changing\nPerson Re-Identification (OC4-ReID), which simultaneously addresses two\nchallenges of clothing changes and occlusion. Concretely, we construct two new\ndatasets, Occ-LTCC and Occ-PRCC, based on original CC-ReID datasets to include\nrandom occlusions of key pedestrians components (e.g., head, torso). Moreover,\na novel benchmark is proposed for OC4-ReID incorporating a Train-Test Micro\nGranularity Screening (T2MGS) module to mitigate the influence of occlusion and\nproposing a Part-Robust Triplet (PRT) loss for partial features learning.\nComprehensive experiments on the proposed datasets, as well as on two CC-ReID\nbenchmark datasets demonstrate the superior performance of proposed method\nagainst other state-of-the-art methods. The codes and datasets are available\nat: https://github.com/1024AILab/OC4-ReID.\n","authors":["Zhihao Chen","Yiyuan Ge","Ziyang Wang","Jiaju Kang","Mingya Zhang"],"pdf_url":"https://arxiv.org/pdf/2403.08557v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.14332v3","updated":"2024-12-24T03:27:45Z","published":"2024-10-18T09:44:25Z","title":"Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension","summary":" Recent advances in Large Language Models (LLMs) have catalyzed the\ndevelopment of Large Multimodal Models (LMMs). However, existing research\nprimarily focuses on tuning language and image instructions, ignoring the\ncritical pretraining phase where models learn to process textual and visual\nmodalities jointly. In this paper, we propose a new pretraining paradigm for\nLMMs to enhance the visual comprehension capabilities of LLMs by introducing a\nnovel cross-modal comprehension stage. Specifically, we design a dynamically\nlearnable prompt token pool and employ the Hungarian algorithm to replace part\nof the original visual tokens with the most relevant prompt tokens. Then, we\nconceptualize visual tokens as analogous to a \"foreign language\" for the LLMs\nand propose a mixed attention mechanism with bidirectional visual attention and\nunidirectional textual attention to comprehensively enhance the understanding\nof visual tokens. Meanwhile, we integrate a detailed caption generation task,\nleveraging rich descriptions to further facilitate LLMs in understanding visual\nsemantic information. After pretraining on 1.5 million publicly accessible\ndata, we present a new foundation model called Croc. Experimental results\ndemonstrate that Croc achieves new state-of-the-art performance on massive\nvision-language benchmarks. To support reproducibility and facilitate further\nresearch, we release the training code and pre-trained model weights at\nhttps://github.com/deepglint/Croc.\n","authors":["Yin Xie","Kaicheng Yang","Ninghua Yang","Weimo Deng","Xiangzi Dai","Tiancheng Gu","Yumeng Wang","Xiang An","Yongle Zhao","Ziyong Feng","Roy Miles","Ismail Elezi","Jiankang Deng"],"pdf_url":"https://arxiv.org/pdf/2410.14332v3.pdf","comment":"14 pages, 12 figures"},{"id":"http://arxiv.org/abs/2412.17504v2","updated":"2024-12-24T03:21:40Z","published":"2024-12-23T12:03:35Z","title":"An Evaluation Framework for Product Images Background Inpainting based\n on Human Feedback and Product Consistency","summary":" In product advertising applications, the automated inpainting of backgrounds\nutilizing AI techniques in product images has emerged as a significant task.\nHowever, the techniques still suffer from issues such as inappropriate\nbackground and inconsistent product in generated product images, and existing\napproaches for evaluating the quality of generated product images are mostly\ninconsistent with human feedback causing the evaluation for this task to depend\non manual annotation. To relieve the issues above, this paper proposes Human\nFeedback and Product Consistency (HFPC), which can automatically assess the\ngenerated product images based on two modules. Firstly, to solve inappropriate\nbackgrounds, human feedback on 44,000 automated inpainting product images is\ncollected to train a reward model based on multi-modal features extracted from\nBLIP and comparative learning. Secondly, to filter generated product images\ncontaining inconsistent products, a fine-tuned segmentation model is employed\nto segment the product of the original and generated product images and then\ncompare the differences between the above two. Extensive experiments have\ndemonstrated that HFPC can effectively evaluate the quality of generated\nproduct images and significantly reduce the expense of manual annotation.\nMoreover, HFPC achieves state-of-the-art(96.4% in precision) in comparison to\nother open-source visual-quality-assessment models. Dataset and code are\navailable at:\nhttps://github.com/created-Bi/background_inpainting_products_dataset\n","authors":["Yuqi Liang","Jun Luo","Xiaoxi Guo","Jianqi Bi"],"pdf_url":"https://arxiv.org/pdf/2412.17504v2.pdf","comment":"accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2412.18124v1","updated":"2024-12-24T03:19:29Z","published":"2024-12-24T03:19:29Z","title":"VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early\n Detection","summary":" The early detection of glottic carcinoma is critical for improving patient\noutcomes, as it enables timely intervention, preserves vocal function, and\nsignificantly reduces the risk of tumor progression and metastasis. However,\nthe similarity in morphology between glottic carcinoma and vocal cord dysplasia\nresults in suboptimal detection accuracy. To address this issue, we propose a\nvision large language model-based (VisionLLM-based) multimodal fusion network\nfor glottic carcinoma detection, known as MMGC-Net. By integrating image and\ntext modalities, multimodal models can capture complementary information,\nleading to more accurate and robust predictions. In this paper, we collect a\nprivate real glottic carcinoma dataset named SYSU1H from the First Affiliated\nHospital of Sun Yat-sen University, with 5,799 image-text pairs. We leverage an\nimage encoder and additional Q-Former to extract vision embeddings and the\nLarge Language Model Meta AI (Llama3) to obtain text embeddings. These\nmodalities are then integrated through a laryngeal feature fusion block,\nenabling a comprehensive integration of image and text features, thereby\nimproving the glottic carcinoma identification performance. Extensive\nexperiments on the SYSU1H dataset demonstrate that MMGC-Net can achieve\nstate-of-the-art performance, which is superior to previous multimodal models.\n","authors":["Zhaohui Jin","Yi Shuai","Yongcheng Li","Lingcong Cai","Yun Li","Huifen Liu","Xiaomao Fan"],"pdf_url":"https://arxiv.org/pdf/2412.18124v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17331v2","updated":"2024-12-24T03:15:44Z","published":"2024-12-23T06:49:59Z","title":"Uncertainty-Participation Context Consistency Learning for\n Semi-supervised Semantic Segmentation","summary":" Semi-supervised semantic segmentation has attracted considerable attention\nfor its ability to mitigate the reliance on extensive labeled data. However,\nexisting consistency regularization methods only utilize high certain pixels\nwith prediction confidence surpassing a fixed threshold for training, failing\nto fully leverage the potential supervisory information within the network.\nTherefore, this paper proposes the Uncertainty-participation Context\nConsistency Learning (UCCL) method to explore richer supervisory signals.\nSpecifically, we first design the semantic backpropagation update (SBU)\nstrategy to fully exploit the knowledge from uncertain pixel regions, enabling\nthe model to learn consistent pixel-level semantic information from those\nareas. Furthermore, we propose the class-aware knowledge regulation (CKR)\nmodule to facilitate the regulation of class-level semantic features across\ndifferent augmented views, promoting consistent learning of class-level\nsemantic information within the encoder. Experimental results on two public\nbenchmarks demonstrate that our proposed method achieves state-of-the-art\nperformance. Our code is available at https://github.com/YUKEKEJAN/UCCL.\n","authors":["Jianjian Yin","Yi Chen","Zhichao Zheng","Junsheng Zhou","Yanhui Gu"],"pdf_url":"https://arxiv.org/pdf/2412.17331v2.pdf","comment":"To be published in ICASSP"},{"id":"http://arxiv.org/abs/2407.08127v2","updated":"2024-12-24T03:02:07Z","published":"2024-07-11T01:58:35Z","title":"Prediction Exposes Your Face: Black-box Model Inversion via Prediction\n Alignment","summary":" Model inversion (MI) attack reconstructs the private training data of a\ntarget model given its output, posing a significant threat to deep learning\nmodels and data privacy. On one hand, most of existing MI methods focus on\nsearching for latent codes to represent the target identity, yet this iterative\noptimization-based scheme consumes a huge number of queries to the target\nmodel, making it unrealistic especially in black-box scenario. On the other\nhand, some training-based methods launch an attack through a single forward\ninference, whereas failing to directly learn high-level mappings from\nprediction vectors to images. Addressing these limitations, we propose a novel\nPrediction-to-Image (P2I) method for black-box MI attack. Specifically, we\nintroduce the Prediction Alignment Encoder to map the target model's output\nprediction into the latent code of StyleGAN. In this way, prediction vector\nspace can be well aligned with the more disentangled latent space, thus\nestablishing a connection between prediction vectors and the semantic facial\nfeatures. During the attack phase, we further design the Aligned Ensemble\nAttack scheme to integrate complementary facial attributes of target identity\nfor better reconstruction. Experimental results show that our method\noutperforms other SOTAs, e.g.,compared with RLB-MI, our method improves attack\naccuracy by 8.5% and reduces query numbers by 99% on dataset CelebA.\n","authors":["Yufan Liu","Wanqian Zhang","Dayan Wu","Zheng Lin","Jingzi Gu","Weiping Wang"],"pdf_url":"https://arxiv.org/pdf/2407.08127v2.pdf","comment":"Accepted by ECCV 2024"},{"id":"http://arxiv.org/abs/2412.18112v1","updated":"2024-12-24T02:52:43Z","published":"2024-12-24T02:52:43Z","title":"Spectrum-oriented Point-supervised Saliency Detector for Hyperspectral\n Images","summary":" Hyperspectral salient object detection (HSOD) aims to extract targets or\nregions with significantly different spectra from hyperspectral images. While\nexisting deep learning-based methods can achieve good detection results, they\ngenerally necessitate pixel-level annotations, which are notably challenging to\nacquire for hyperspectral images. To address this issue, we introduce point\nsupervision into HSOD, and incorporate Spectral Saliency, derived from\nconventional HSOD methods, as a pivotal spectral representation within the\nframework. This integration leads to the development of a novel\nSpectrum-oriented Point-supervised Saliency Detector (SPSD). Specifically, we\npropose a novel pipeline, specifically designed for HSIs, to generate\npseudo-labels, effectively mitigating the performance decline associated with\npoint supervision strategy. Additionally, Spectral Saliency is employed to\ncounteract information loss during model supervision and saliency refinement,\nthereby maintaining the structural integrity and edge accuracy of the detected\nobjects. Furthermore, we introduce a Spectrum-transformed Spatial Gate to focus\nmore precisely on salient regions while reducing feature redundancy. We have\ncarried out comprehensive experiments on both HSOD-BIT and HS-SOD datasets to\nvalidate the efficacy of our proposed method, using mean absolute error (MAE),\nE-measure, F-measure, Area Under Curve, and Cross Correlation as evaluation\nmetrics. For instance, on the HSOD-BIT dataset, our SPSD achieves a MAE of\n0.031 and an F-measure of 0.878. Thorough ablation studies have substantiated\nthe effectiveness of each individual module and provided insights into the\nmodel's working mechanism. Further evaluations on RGB-thermal salient object\ndetection datasets highlight the versatility of our approach.\n","authors":["Peifu Liu","Tingfa Xu","Guokai Shi","Jingxuan Xu","Huan Chen","Jianan Li"],"pdf_url":"https://arxiv.org/pdf/2412.18112v1.pdf","comment":"Accepted by IEEE TIM. Code: https://github.com/laprf/SPSD"},{"id":"http://arxiv.org/abs/2411.10958v3","updated":"2024-12-24T02:50:14Z","published":"2024-11-17T04:35:49Z","title":"SageAttention2: Efficient Attention with Thorough Outlier Smoothing and\n Per-thread INT4 Quantization","summary":" Although quantization for linear layers has been widely used, its application\nto accelerate the attention process remains limited. To further enhance the\nefficiency of attention computation compared to SageAttention while maintaining\nprecision, we propose SageAttention2, which utilizes significantly faster 4-bit\nmatrix multiplication (Matmul) alongside additional precision-enhancing\ntechniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a\nhardware-friendly thread-level granularity and quantize matrixes $(\\widetilde\nP, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the\naccuracy of INT4 $QK$. Third, we propose to use an FP32 Matmul buffer for $PV$\nto enhance the accuracy of FP8 $\\widetilde PV$. The operations per second (OPS)\nof SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on\nRTX4090, respectively. Comprehensive experiments confirm that our approach\nincurs negligible end-to-end metrics loss across diverse models, including\nthose for large language processing, image generation, and video generation.\nThe codes are available at https://github.com/thu-ml/SageAttention.\n","authors":["Jintao Zhang","Haofeng Huang","Pengle Zhang","Jia Wei","Jun Zhu","Jianfei Chen"],"pdf_url":"https://arxiv.org/pdf/2411.10958v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17635v2","updated":"2024-12-24T02:48:55Z","published":"2024-12-23T15:12:20Z","title":"LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding","summary":" Applying Gaussian Splatting to perception tasks for 3D scene understanding is\nbecoming increasingly popular. Most existing works primarily focus on rendering\n2D feature maps from novel viewpoints, which leads to an imprecise 3D language\nfield with outlier languages, ultimately failing to align objects in 3D space.\nBy utilizing masked images for feature extraction, these approaches also lack\nessential contextual information, leading to inaccurate feature representation.\nTo this end, we propose a Language-Embedded Surface Field (LangSurf), which\naccurately aligns the 3D language fields with the surface of objects,\nfacilitating precise 2D and 3D segmentation with text query, widely expanding\nthe downstream tasks such as removal and editing. The core of LangSurf is a\njoint training strategy that flattens the language Gaussian on the object\nsurfaces using geometry supervision and contrastive losses to assign accurate\nlanguage features to the Gaussians of objects. In addition, we also introduce\nthe Hierarchical-Context Awareness Module to extract features at the image\nlevel for contextual information then perform hierarchical mask pooling using\nmasks segmented by SAM to obtain fine-grained language features in different\nhierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic\nsegmentation demonstrate that LangSurf outperforms the previous\nstate-of-the-art method LangSplat by a large margin. As shown in Fig. 1, our\nmethod is capable of segmenting objects in 3D space, thus boosting the\neffectiveness of our approach in instance recognition, removal, and editing,\nwhich is also supported by comprehensive experiments.\n\\url{https://langsurf.github.io}.\n","authors":["Hao Li","Roy Qin","Zhengyu Zou","Diqi He","Bohan Li","Bingquan Dai","Dingewn Zhang","Junwei Han"],"pdf_url":"https://arxiv.org/pdf/2412.17635v2.pdf","comment":"\\url{https://langsurf.github.io}"},{"id":"http://arxiv.org/abs/2412.13496v2","updated":"2024-12-24T02:41:04Z","published":"2024-12-18T04:34:46Z","title":"QueryCDR: Query-Based Controllable Distortion Rectification Network for\n Fisheye Images","summary":" Fisheye image rectification aims to correct distortions in images taken with\nfisheye cameras. Although current models show promising results on images with\na similar degree of distortion as the training data, they will produce\nsub-optimal results when the degree of distortion changes and without\nretraining. The lack of generalization ability for dealing with varying degrees\nof distortion limits their practical application. In this paper, we take one\nstep further to enable effective distortion rectification for images with\nvarying degrees of distortion without retraining. We propose a novel\nQuery-Based Controllable Distortion Rectification network for fisheye images\n(QueryCDR). In particular, we first present the Distortion-aware Learnable\nQuery Mechanism (DLQM), which defines the latent spatial relationships for\ndifferent distortion degrees as a series of learnable queries. Each query can\nbe learned to obtain position-dependent rectification control conditions,\nproviding control over the rectification process. Then, we propose two kinds of\ncontrollable modulating blocks to enable the control conditions to guide the\nmodulation of the distortion features better. These core components cooperate\nwith each other to effectively boost the generalization ability of the model at\nvarying degrees of distortion. Extensive experiments on fisheye image datasets\nwith different distortion degrees demonstrate our approach achieves\nhigh-quality and controllable distortion rectification.\n","authors":["Pengbo Guo","Chengxu Liu","Xingsong Hou","Xueming Qian"],"pdf_url":"https://arxiv.org/pdf/2412.13496v2.pdf","comment":"ECCV2024"},{"id":"http://arxiv.org/abs/2412.18108v1","updated":"2024-12-24T02:31:24Z","published":"2024-12-24T02:31:24Z","title":"Unveiling Visual Perception in Language Models: An Attention Head\n Analysis Approach","summary":" Recent advancements in Multimodal Large Language Models (MLLMs) have\ndemonstrated remarkable progress in visual understanding. This impressive leap\nraises a compelling question: how can language models, initially trained solely\non linguistic data, effectively interpret and process visual content? This\npaper aims to address this question with systematic investigation across 4\nmodel families and 4 model scales, uncovering a unique class of attention heads\nthat focus specifically on visual content. Our analysis reveals a strong\ncorrelation between the behavior of these attention heads, the distribution of\nattention weights, and their concentration on visual tokens within the input.\nThese findings enhance our understanding of how LLMs adapt to multimodal tasks,\ndemonstrating their potential to bridge the gap between textual and visual\nunderstanding. This work paves the way for the development of AI systems\ncapable of engaging with diverse modalities.\n","authors":["Jing Bi","Junjia Guo","Yunlong Tang","Lianggong Bruce Wen","Zhang Liu","Chenliang Xu"],"pdf_url":"https://arxiv.org/pdf/2412.18108v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10748v2","updated":"2024-12-24T02:27:53Z","published":"2024-12-14T08:31:56Z","title":"A Pioneering Neural Network Method for Efficient and Robust Fuel\n Sloshing Simulation in Aircraft","summary":" Simulating fuel sloshing within aircraft tanks during flight is crucial for\naircraft safety research. Traditional methods based on Navier-Stokes equations\nare computationally expensive. In this paper, we treat fluid motion as point\ncloud transformation and propose the first neural network method specifically\ndesigned for simulating fuel sloshing in aircraft. This model is also the deep\nlearning model that is the first to be capable of stably modeling fluid\nparticle dynamics in such complex scenarios. Our triangle feature fusion design\nachieves an optimal balance among fluid dynamics modeling, momentum\nconservation constraints, and global stability control. Additionally, we\nconstructed the Fueltank dataset, the first dataset for aircraft fuel surface\nsloshing. It comprises 320,000 frames across four typical tank types and covers\na wide range of flight maneuvers, including multi-directional rotations. We\nconducted comprehensive experiments on both our dataset and the take-off\nscenario of the aircraft. Compared to existing neural network-based fluid\nsimulation algorithms, we significantly enhanced accuracy while maintaining\nhigh computational speed. Compared to traditional SPH methods, our speed\nimproved approximately 10 times. Furthermore, compared to traditional fluid\nsimulation software such as Flow3D, our computation speed increased by more\nthan 300 times.\n","authors":["Yu Chen","Shuai Zheng","Nianyi Wang","Menglong Jin","Yan Chang"],"pdf_url":"https://arxiv.org/pdf/2412.10748v2.pdf","comment":"This paper has been accepted by AAAI Conference on Artificial\n Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2412.18105v1","updated":"2024-12-24T02:27:35Z","published":"2024-12-24T02:27:35Z","title":"Beyond the Known: Enhancing Open Set Domain Adaptation with Unknown\n Exploration","summary":" Convolutional neural networks (CNNs) can learn directly from raw data,\nresulting in exceptional performance across various research areas. However,\nfactors present in non-controllable environments such as unlabeled datasets\nwith varying levels of domain and category shift can reduce model accuracy. The\nOpen Set Domain Adaptation (OSDA) is a challenging problem that arises when\nboth of these issues occur together. Existing OSDA approaches in literature\nonly align known classes or use supervised training to learn unknown classes as\na single new category. In this work, we introduce a new approach to improve\nOSDA techniques by extracting a set of high-confidence unknown instances and\nusing it as a hard constraint to tighten the classification boundaries.\nSpecifically, we use a new loss constraint that is evaluated in three different\nways: (1) using pristine negative instances directly; (2) using data\naugmentation techniques to create randomly transformed negatives; and (3) with\ngenerated synthetic negatives containing adversarial features. We analyze\ndifferent strategies to improve the discriminator and the training of the\nGenerative Adversarial Network (GAN) used to generate synthetic negatives. We\nconducted extensive experiments and analysis on OVANet using three widely-used\npublic benchmarks, the Office-31, Office-Home, and VisDA datasets. We were able\nto achieve similar H-score to other state-of-the-art methods, while increasing\nthe accuracy on unknown categories.\n","authors":["Lucas Fernando Alvarenga e Silva","Samuel Felipe dos Santos","Nicu Sebe","Jurandy Almeida"],"pdf_url":"https://arxiv.org/pdf/2412.18105v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.02431v2","updated":"2024-12-24T02:20:02Z","published":"2024-02-04T10:00:00Z","title":"Learning Mutual Excitation for Hand-to-Hand and Human-to-Human\n Interaction Recognition","summary":" Recognizing interactive actions, including hand-to-hand interaction and\nhuman-to-human interaction, has attracted increasing attention for various\napplications in the field of video analysis and human-robot interaction.\nConsidering the success of graph convolution in modeling topology-aware\nfeatures from skeleton data, recent methods commonly operate graph convolution\non separate entities and use late fusion for interactive action recognition,\nwhich can barely model the mutual semantic relationships between pairwise\nentities. To this end, we propose a mutual excitation graph convolutional\nnetwork (me-GCN) by stacking mutual excitation graph convolution (me-GC)\nlayers. Specifically, me-GC uses a mutual topology excitation module to firstly\nextract adjacency matrices from individual entities and then adaptively model\nthe mutual constraints between them. Moreover, me-GC extends the above idea and\nfurther uses a mutual feature excitation module to extract and merge deep\nfeatures from pairwise entities. Compared with graph convolution, our proposed\nme-GC gradually learns mutual information in each layer and each stage of graph\nconvolution operations. Extensive experiments on a challenging hand-to-hand\ninteraction dataset, i.e., the Assembely101 dataset, and two large-scale\nhuman-to-human interaction datasets, i.e., NTU60-Interaction and\nNTU120-Interaction consistently verify the superiority of our proposed method,\nwhich outperforms the state-of-the-art GCN-based and Transformer-based methods.\n","authors":["Mengyuan Liu","Chen Chen","Songtao Wu","Fanyang Meng","Hong Liu"],"pdf_url":"https://arxiv.org/pdf/2402.02431v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16991v3","updated":"2024-12-24T02:17:39Z","published":"2024-02-26T19:52:33Z","title":"A Phase Transition in Diffusion Models Reveals the Hierarchical Nature\n of Data","summary":" Understanding the structure of real data is paramount in advancing modern\ndeep-learning methodologies. Natural data such as images are believed to be\ncomposed of features organized in a hierarchical and combinatorial manner,\nwhich neural networks capture during learning. Recent advancements show that\ndiffusion models can generate high-quality images, hinting at their ability to\ncapture this underlying compositional structure. We study this phenomenon in a\nhierarchical generative model of data. We find that the backward diffusion\nprocess acting after a time $t$ is governed by a phase transition at some\nthreshold time, where the probability of reconstructing high-level features,\nlike the class of an image, suddenly drops. Instead, the reconstruction of\nlow-level features, such as specific details of an image, evolves smoothly\nacross the whole diffusion process. This result implies that at times beyond\nthe transition, the class has changed, but the generated sample may still be\ncomposed of low-level elements of the initial image. We validate these\ntheoretical insights through numerical experiments on class-unconditional\nImageNet diffusion models. Our analysis characterizes the relationship between\ntime and scale in diffusion models and puts forward generative models as\npowerful tools to model combinatorial data properties.\n","authors":["Antonio Sclocchi","Alessandro Favero","Matthieu Wyart"],"pdf_url":"https://arxiv.org/pdf/2402.16991v3.pdf","comment":"9 pages, 7 figures. Appendix: 11 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.18090v1","updated":"2024-12-24T02:04:47Z","published":"2024-12-24T02:04:47Z","title":"Multi-Point Positional Insertion Tuning for Small Object Detection","summary":" Small object detection aims to localize and classify small objects within\nimages. With recent advances in large-scale vision-language pretraining,\nfinetuning pretrained object detection models has emerged as a promising\napproach. However, finetuning large models is computationally and memory\nexpensive. To address this issue, this paper introduces multi-point positional\ninsertion (MPI) tuning, a parameter-efficient finetuning (PEFT) method for\nsmall object detection. Specifically, MPI incorporates multiple positional\nembeddings into a frozen pretrained model, enabling the efficient detection of\nsmall objects by providing precise positional information to latent features.\nThrough experiments, we demonstrated the effectiveness of the proposed method\non the SODA-D dataset. MPI performed comparably to conventional PEFT methods,\nincluding CoOp and VPT, while significantly reducing the number of parameters\nthat need to be tuned.\n","authors":["Kanoko Goto","Takumi Karasawa","Takumi Hirose","Rei Kawakami","Nakamasa Inoue"],"pdf_url":"https://arxiv.org/pdf/2412.18090v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18089v1","updated":"2024-12-24T02:04:07Z","published":"2024-12-24T02:04:07Z","title":"Convolutional Prompting for Broad-Domain Retinal Vessel Segmentation","summary":" Previous research on retinal vessel segmentation is targeted at a specific\nimage domain, mostly color fundus photography (CFP). In this paper we make a\nbrave attempt to attack a more challenging task of broad-domain retinal vessel\nsegmentation (BD-RVS), which is to develop a unified model applicable to varied\ndomains including CFP, SLO, UWF, OCTA and FFA. To that end, we propose Dual\nConvoltuional Prompting (DCP) that learns to extract domain-specific features\nby localized prompting along both position and channel dimensions. DCP is\ndesigned as a plug-in module that can effectively turn a R2AU-Net based vessel\nsegmentation network to a unified model, yet without the need of modifying its\nnetwork structure. For evaluation we build a broad-domain set using five public\ndomain-specific datasets including ROSSA, FIVES, IOSTAR, PRIME-FP20 and\nVAMPIRE. In order to benchmark BD-RVS on the broad-domain dataset, we\nre-purpose a number of existing methods originally developed in other contexts,\nproducing eight baseline methods in total. Extensive experiments show the the\nproposed method compares favorably against the baselines for BD-RVS.\n","authors":["Qijie Wei","Weihong Yu","Xirong Li"],"pdf_url":"https://arxiv.org/pdf/2412.18089v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.16839v2","updated":"2024-12-24T01:53:54Z","published":"2024-12-22T03:15:39Z","title":"Human-Guided Image Generation for Expanding Small-Scale Training Image\n Datasets","summary":" The performance of computer vision models in certain real-world applications\n(e.g., rare wildlife observation) is limited by the small number of available\nimages. Expanding datasets using pre-trained generative models is an effective\nway to address this limitation. However, since the automatic generation process\nis uncontrollable, the generated images are usually limited in diversity, and\nsome of them are undesired. In this paper, we propose a human-guided image\ngeneration method for more controllable dataset expansion. We develop a\nmulti-modal projection method with theoretical guarantees to facilitate the\nexploration of both the original and generated images. Based on the\nexploration, users refine the prompts and re-generate images for better\nperformance. Since directly refining the prompts is challenging for novice\nusers, we develop a sample-level prompt refinement method to make it easier.\nWith this method, users only need to provide sample-level feedback (e.g., which\nsamples are undesired) to obtain better prompts. The effectiveness of our\nmethod is demonstrated through the quantitative evaluation of the multi-modal\nprojection method, improved model performance in the case study for both\nclassification and object detection tasks, and positive feedback from the\nexperts.\n","authors":["Changjian Chen","Fei Lv","Yalong Guan","Pengcheng Wang","Shengjie Yu","Yifan Zhang","Zhuo Tang"],"pdf_url":"https://arxiv.org/pdf/2412.16839v2.pdf","comment":"Accepted by TVCG2025"},{"id":"http://arxiv.org/abs/2404.15734v4","updated":"2024-12-24T01:53:17Z","published":"2024-04-24T08:46:25Z","title":"ODMixer: Fine-grained Spatial-temporal MLP for Metro Origin-Destination\n Prediction","summary":" Metro Origin-Destination (OD) prediction is a crucial yet challenging\nspatial-temporal prediction task in urban computing, which aims to accurately\nforecast cross-station ridership for optimizing metro scheduling and enhancing\noverall transport efficiency. Analyzing fine-grained and comprehensive\nrelations among stations effectively is imperative for metro OD prediction.\nHowever, existing metro OD models either mix information from multiple OD pairs\nfrom the station's perspective or exclusively focus on a subset of OD pairs.\nThese approaches may overlook fine-grained relations among OD pairs, leading to\ndifficulties in predicting potential anomalous conditions. To address these\nchallenges, we learn traffic evolution from the perspective of all OD pairs and\npropose a fine-grained spatial-temporal MLP architecture for metro OD\nprediction, namely ODMixer. Specifically, our ODMixer has double-branch\nstructure and involves the Channel Mixer, the Multi-view Mixer, and the\nBidirectional Trend Learner. The Channel Mixer aims to capture short-term\ntemporal relations among OD pairs, the Multi-view Mixer concentrates on\ncapturing spatial relations from both origin and destination perspectives. To\nmodel long-term temporal relations, we introduce the Bidirectional Trend\nLearner. Extensive experiments on two large-scale metro OD prediction datasets\nHZMOD and SHMO demonstrate the advantages of our ODMixer. Our code is available\nat https://github.com/KLatitude/ODMixer.\n","authors":["Yang Liu","Binglin Chen","Yongsen Zheng","Lechao Cheng","Guanbin Li","Liang Lin"],"pdf_url":"https://arxiv.org/pdf/2404.15734v4.pdf","comment":"Code is available at https://github.com/KLatitude/ODMixer"},{"id":"http://arxiv.org/abs/2412.13541v2","updated":"2024-12-24T01:50:55Z","published":"2024-12-18T06:40:53Z","title":"Spatio-Temporal Fuzzy-oriented Multi-Modal Meta-Learning for\n Fine-grained Emotion Recognition","summary":" Fine-grained emotion recognition (FER) plays a vital role in various fields,\nsuch as disease diagnosis, personalized recommendations, and multimedia mining.\nHowever, existing FER methods face three key challenges in real-world\napplications: (i) they rely on large amounts of continuously annotated data to\nensure accuracy since emotions are complex and ambiguous in reality, which is\ncostly and time-consuming; (ii) they cannot capture the temporal heterogeneity\ncaused by changing emotion patterns, because they usually assume that the\ntemporal correlation within sampling periods is the same; (iii) they do not\nconsider the spatial heterogeneity of different FER scenarios, that is, the\ndistribution of emotion information in different data may have bias or\ninterference. To address these challenges, we propose a Spatio-Temporal\nFuzzy-oriented Multi-modal Meta-learning framework (ST-F2M). Specifically,\nST-F2M first divides the multi-modal videos into multiple views, and each view\ncorresponds to one modality of one emotion. Multiple randomly selected views\nfor the same emotion form a meta-training task. Next, ST-F2M uses an integrated\nmodule with spatial and temporal convolutions to encode the data of each task,\nreflecting the spatial and temporal heterogeneity. Then it adds fuzzy semantic\ninformation to each task based on generalized fuzzy rules, which helps handle\nthe complexity and ambiguity of emotions. Finally, ST-F2M learns\nemotion-related general meta-knowledge through meta-recurrent neural networks\nto achieve fast and robust fine-grained emotion recognition. Extensive\nexperiments show that ST-F2M outperforms various state-of-the-art methods in\nterms of accuracy and model efficiency. In addition, we construct ablation\nstudies and further analysis to explore why ST-F2M performs well.\n","authors":["Jingyao Wang","Yuxuan Yang","Wenwen Qiang","Changwen Zheng","Hui Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.13541v2.pdf","comment":"13 pages, Submitted to TMM in 30-May-2024"},{"id":"http://arxiv.org/abs/2412.18076v1","updated":"2024-12-24T01:14:48Z","published":"2024-12-24T01:14:48Z","title":"COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal\n Object Detection","summary":" Single-modal object detection tasks often experience performance degradation\nwhen encountering diverse scenarios. In contrast, multimodal object detection\ntasks can offer more comprehensive information about object features by\nintegrating data from various modalities. Current multimodal object detection\nmethods generally use various fusion techniques, including conventional neural\nnetworks and transformer-based models, to implement feature fusion strategies\nand achieve complementary information. However, since multimodal images are\ncaptured by different sensors, there are often misalignments between them,\nmaking direct matching challenging. This misalignment hinders the ability to\nestablish strong correlations for the same object across different modalities.\nIn this paper, we propose a novel approach called the CrOss-Mamba interaction\nand Offset-guided fusion (COMO) framework for multimodal object detection\ntasks. The COMO framework employs the cross-mamba technique to formulate\nfeature interaction equations, enabling multimodal serialized state\ncomputation. This results in interactive fusion outputs while reducing\ncomputational overhead and improving efficiency. Additionally, COMO leverages\nhigh-level features, which are less affected by misalignment, to facilitate\ninteraction and transfer complementary information between modalities,\naddressing the positional offset challenges caused by variations in camera\nangles and capture times. Furthermore, COMO incorporates a global and local\nscanning mechanism in the cross-mamba module to capture features with local\ncorrelation, particularly in remote sensing images. To preserve low-level\nfeatures, the offset-guided fusion mechanism ensures effective multiscale\nfeature utilization, allowing the construction of a multiscale fusion data cube\nthat enhances detection performance.\n","authors":["Chang Liu","Xin Ma","Xiaochen Yang","Yuxiang Zhang","Yanni Dong"],"pdf_url":"https://arxiv.org/pdf/2412.18076v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07132v2","updated":"2024-12-24T01:01:19Z","published":"2024-12-10T02:41:21Z","title":"Revisiting Lesion Tracking in 3D Total Body Photography","summary":" Melanoma is the most deadly form of skin cancer. Tracking the evolution of\nnevi and detecting new lesions across the body is essential for the early\ndetection of melanoma. Despite prior work on longitudinal tracking of skin\nlesions in 3D total body photography, there are still several challenges,\nincluding 1) low accuracy for finding correct lesion pairs across scans, 2)\nsensitivity to noisy lesion detection, and 3) lack of large-scale datasets with\nnumerous annotated lesion pairs. We propose a framework that takes in a pair of\n3D textured meshes, matches lesions in the context of total body photography,\nand identifies unmatchable lesions. We start by computing correspondence maps\nbringing the source and target meshes to a template mesh. Using these maps to\ndefine source/target signals over the template domain, we construct a flow\nfield aligning the mapped signals. The initial correspondence maps are then\nrefined by advecting forward/backward along the vector field. Finally, lesion\nassignment is performed using the refined correspondence maps. We propose the\nfirst large-scale dataset for skin lesion tracking with 25K lesion pairs across\n198 subjects. The proposed method achieves a success rate of 89.9% (at 10 mm\ncriterion) for all pairs of annotated lesions and a matching accuracy of 98.2%\nfor subjects with more than 200 lesions.\n","authors":["Wei-Lun Huang","Minghao Xue","Zhiyou Liu","Davood Tashayyod","Jun Kang","Amir Gandjbakhche","Misha Kazhdan","Mehran Armand"],"pdf_url":"https://arxiv.org/pdf/2412.07132v2.pdf","comment":"v2"},{"id":"http://arxiv.org/abs/2412.18072v1","updated":"2024-12-24T00:59:16Z","published":"2024-12-24T00:59:16Z","title":"MMFactory: A Universal Solution Search Engine for Vision-Language Tasks","summary":" With advances in foundational and vision-language models, and effective\nfine-tuning techniques, a large number of both general and special-purpose\nmodels have been developed for a variety of visual tasks. Despite the\nflexibility and accessibility of these models, no single model is able to\nhandle all tasks and/or applications that may be envisioned by potential users.\nRecent approaches, such as visual programming and multimodal LLMs with\nintegrated tools aim to tackle complex visual tasks, by way of program\nsynthesis. However, such approaches overlook user constraints (e.g.,\nperformance / computational needs), produce test-time sample-specific solutions\nthat are difficult to deploy, and, sometimes, require low-level instructions\nthat maybe beyond the abilities of a naive user. To address these limitations,\nwe introduce MMFactory, a universal framework that includes model and metrics\nrouting components, acting like a solution search engine across various\navailable models. Based on a task description and few sample input-output pairs\nand (optionally) resource and/or performance constraints, MMFactory can suggest\na diverse pool of programmatic solutions by instantiating and combining\nvisio-lingual tools from its model repository. In addition to synthesizing\nthese solutions, MMFactory also proposes metrics and benchmarks performance /\nresource characteristics, allowing users to pick a solution that meets their\nunique design constraints. From the technical perspective, we also introduced a\ncommittee-based solution proposer that leverages multi-agent LLM conversation\nto generate executable, diverse, universal, and robust solutions for the user.\nExperimental results show that MMFactory outperforms existing methods by\ndelivering state-of-the-art solutions tailored to user problem specifications.\nProject page is available at https://davidhalladay.github.io/mmfactory_demo.\n","authors":["Wan-Cyuan Fan","Tanzila Rahman","Leonid Sigal"],"pdf_url":"https://arxiv.org/pdf/2412.18072v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10360v3","updated":"2024-12-24T00:55:15Z","published":"2024-08-19T18:56:24Z","title":"HaSPeR: An Image Repository for Hand Shadow Puppet Recognition","summary":" Hand shadow puppetry, also known as shadowgraphy or ombromanie, is a form of\ntheatrical art and storytelling where hand shadows are projected onto flat\nsurfaces to create illusions of living creatures. The skilled performers create\nthese silhouettes by hand positioning, finger movements, and dexterous gestures\nto resemble shadows of animals and objects. Due to the lack of practitioners\nand a seismic shift in people's entertainment standards, this art form is on\nthe verge of extinction. To facilitate its preservation and proliferate it to a\nwider audience, we introduce ${\\rm H{\\small A}SP{\\small E}R}$, a novel dataset\nconsisting of 15,000 images of hand shadow puppets across 15 classes extracted\nfrom both professional and amateur hand shadow puppeteer clips. We provide a\ndetailed statistical analysis of the dataset and employ a range of pretrained\nimage classification models to establish baselines. Our findings show a\nsubstantial performance superiority of skip-connected convolutional models over\nattention-based transformer architectures. We also find that lightweight\nmodels, such as MobileNetV2, suited for mobile applications and embedded\ndevices, perform comparatively well. We surmise that such low-latency\narchitectures can be useful in developing ombromanie teaching tools, and we\ncreate a prototype application to explore this surmission. Keeping the\nbest-performing model ResNet34 under the limelight, we conduct comprehensive\nfeature-spatial, explainability, and error analyses to gain insights into its\ndecision-making process. To the best of our knowledge, this is the first\ndocumented dataset and research endeavor to preserve this dying art for future\ngenerations, with computer vision approaches. Our code and data will be\npublicly available.\n","authors":["Syed Rifat Raiyan","Zibran Zarif Amio","Sabbir Ahmed"],"pdf_url":"https://arxiv.org/pdf/2408.10360v3.pdf","comment":"Submitted to IEEE Transactions on Artificial Intelligence (IEEE TAI),\n 13 pages, 105 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.18065v1","updated":"2024-12-24T00:28:28Z","published":"2024-12-24T00:28:28Z","title":"BIG-MoE: Bypass Isolated Gating MoE for Generalized Multimodal Face\n Anti-Spoofing","summary":" In the domain of facial recognition security, multimodal Face Anti-Spoofing\n(FAS) is essential for countering presentation attacks. However, existing\ntechnologies encounter challenges due to modality biases and imbalances, as\nwell as domain shifts. Our research introduces a Mixture of Experts (MoE) model\nto address these issues effectively. We identified three limitations in\ntraditional MoE approaches to multimodal FAS: (1) Coarse-grained experts'\ninability to capture nuanced spoofing indicators; (2) Gated networks'\nsusceptibility to input noise affecting decision-making; (3) MoE's sensitivity\nto prompt tokens leading to overfitting with conventional learning methods. To\nmitigate these, we propose the Bypass Isolated Gating MoE (BIG-MoE) framework,\nfeaturing: (1) Fine-grained experts for enhanced detection of subtle spoofing\ncues; (2) An isolation gating mechanism to counteract input noise; (3) A novel\ndifferential convolutional prompt bypass enriching the gating network with\ncritical local features, thereby improving perceptual capabilities. Extensive\nexperiments on four benchmark datasets demonstrate significant generalization\nperformance improvement in multimodal FAS task. The code is released at\nhttps://github.com/murInJ/BIG-MoE.\n","authors":["Yingjie Ma","Zitong Yu","Xun Lin","Weicheng Xie","Linlin Shen"],"pdf_url":"https://arxiv.org/pdf/2412.18065v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18060v1","updated":"2024-12-24T00:13:10Z","published":"2024-12-24T00:13:10Z","title":"An Ensemble Approach to Short-form Video Quality Assessment Using\n Multimodal LLM","summary":" The rise of short-form videos, characterized by diverse content, editing\nstyles, and artifacts, poses substantial challenges for learning-based blind\nvideo quality assessment (BVQA) models. Multimodal large language models\n(MLLMs), renowned for their superior generalization capabilities, present a\npromising solution. This paper focuses on effectively leveraging a pretrained\nMLLM for short-form video quality assessment, regarding the impacts of\npre-processing and response variability, and insights on combining the MLLM\nwith BVQA models. We first investigated how frame pre-processing and sampling\ntechniques influence the MLLM's performance. Then, we introduced a lightweight\nlearning-based ensemble method that adaptively integrates predictions from the\nMLLM and state-of-the-art BVQA models. Our results demonstrated superior\ngeneralization performance with the proposed ensemble approach. Furthermore,\nthe analysis of content-aware ensemble weights highlighted that some video\ncharacteristics are not fully represented by existing BVQA models, revealing\npotential directions to improve BVQA models further.\n","authors":["Wen Wen","Yilin Wang","Neil Birkbeck","Balu Adsumilli"],"pdf_url":"https://arxiv.org/pdf/2412.18060v1.pdf","comment":"Accepted by ICASSP 2025"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2409.02425v2","updated":"2024-12-24T15:15:33Z","published":"2024-09-04T04:12:22Z","title":"Deep Adaptive Interest Network: Personalized Recommendation with\n Context-Aware Learning","summary":" In personalized recommendation systems, accurately capturing users' evolving\ninterests and combining them with contextual information is a critical research\narea. This paper proposes a novel model called the Deep Adaptive Interest\nNetwork (DAIN), which dynamically models users' interests while incorporating\ncontext-aware learning mechanisms to achieve precise and adaptive personalized\nrecommendations. DAIN leverages deep learning techniques to build an adaptive\ninterest network structure that can capture users' interest changes in\nreal-time while further optimizing recommendation results by integrating\ncontextual information. Experiments conducted on several public datasets\ndemonstrate that DAIN excels in both recommendation performance and\ncomputational efficiency. This research not only provides a new solution for\npersonalized recommendation systems but also offers fresh insights into the\napplication of context-aware learning in recommendation systems.\n","authors":["Shuaishuai Huang","Haowei Yang","You Yao","Xueting Lin","Yuming Tu"],"pdf_url":"https://arxiv.org/pdf/2409.02425v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18431v1","updated":"2024-12-24T13:45:22Z","published":"2024-12-24T13:45:22Z","title":"GeAR: Graph-enhanced Agent for Retrieval-augmented Generation","summary":" Retrieval-augmented generation systems rely on effective document retrieval\ncapabilities. By design, conventional sparse or dense retrievers face\nchallenges in multi-hop retrieval scenarios. In this paper, we present GeAR,\nwhich advances RAG performance through two key innovations: (i) graph\nexpansion, which enhances any conventional base retriever, such as BM25, and\n(ii) an agent framework that incorporates graph expansion. Our evaluation\ndemonstrates GeAR's superior retrieval performance on three multi-hop question\nanswering datasets. Additionally, our system achieves state-of-the-art results\nwith improvements exceeding 10% on the challenging MuSiQue dataset, while\nrequiring fewer tokens and iterations compared to other multi-step retrieval\nsystems.\n","authors":["Zhili Shen","Chenxin Diao","Pavlos Vougiouklis","Pascual Merita","Shriram Piramanayagam","Damien Graux","Dandan Tu","Zeren Jiang","Ruofei Lai","Yang Ren","Jeff Z. Pan"],"pdf_url":"https://arxiv.org/pdf/2412.18431v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.04739v2","updated":"2024-12-24T13:18:49Z","published":"2024-10-07T04:15:02Z","title":"TableRAG: Million-Token Table Understanding with Language Models","summary":" Recent advancements in language models (LMs) have notably enhanced their\nability to reason with tabular data, primarily through program-aided mechanisms\nthat manipulate and analyze tables. However, these methods often require the\nentire table as input, leading to scalability challenges due to the positional\nbias or context length constraints. In response to these challenges, we\nintroduce TableRAG, a Retrieval-Augmented Generation (RAG) framework\nspecifically designed for LM-based table understanding. TableRAG leverages\nquery expansion combined with schema and cell retrieval to pinpoint crucial\ninformation before providing it to the LMs. This enables more efficient data\nencoding and precise retrieval, significantly reducing prompt lengths and\nmitigating information loss. We have developed two new million-token benchmarks\nfrom the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's\neffectiveness at scale. Our results demonstrate that TableRAG's retrieval\ndesign achieves the highest retrieval quality, leading to the new\nstate-of-the-art performance on large-scale table understanding.\n","authors":["Si-An Chen","Lesly Miculicich","Julian Martin Eisenschlos","Zifeng Wang","Zilong Wang","Yanfei Chen","Yasuhisa Fujii","Hsuan-Tien Lin","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2410.04739v2.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.18396v1","updated":"2024-12-24T12:39:23Z","published":"2024-12-24T12:39:23Z","title":"Contrastive Representation for Interactive Recommendation","summary":" Interactive Recommendation (IR) has gained significant attention recently for\nits capability to quickly capture dynamic interest and optimize both short and\nlong term objectives. IR agents are typically implemented through Deep\nReinforcement Learning (DRL), because DRL is inherently compatible with the\ndynamic nature of IR. However, DRL is currently not perfect for IR. Due to the\nlarge action space and sample inefficiency problem, training DRL recommender\nagents is challenging. The key point is that useful features cannot be\nextracted as high-quality representations for the recommender agent to optimize\nits policy. To tackle this problem, we propose Contrastive Representation for\nInteractive Recommendation (CRIR). CRIR efficiently extracts latent, high-level\npreference ranking features from explicit interaction, and leverages the\nfeatures to enhance users' representation. Specifically, the CRIR provides\nrepresentation through one representation network, and refines it through our\nproposed Preference Ranking Contrastive Learning (PRCL). The key insight of\nPRCL is that it can perform contrastive learning without relying on\ncomputations involving high-level representations or large potential action\nsets. Furthermore, we also propose a data exploiting mechanism and an agent\ntraining mechanism to better adapt CRIR to the DRL backbone. Extensive\nexperiments have been carried out to show our method's superior improvement on\nthe sample efficiency while training an DRL-based IR agent.\n","authors":["Jingyu Li","Zhiyong Feng","Dongxiao He","Hongqi Chen","Qinghang Gao","Guoli Wu"],"pdf_url":"https://arxiv.org/pdf/2412.18396v1.pdf","comment":"AAAI-2025 Accepted paper"},{"id":"http://arxiv.org/abs/2412.18378v1","updated":"2024-12-24T12:07:48Z","published":"2024-12-24T12:07:48Z","title":"RaSeRec: Retrieval-Augmented Sequential Recommendation","summary":" Although prevailing supervised and self-supervised learning (SSL)-augmented\nsequential recommendation (SeRec) models have achieved improved performance\nwith powerful neural network architectures, we argue that they still suffer\nfrom two limitations: (1) Preference Drift, where models trained on past data\ncan hardly accommodate evolving user preference; and (2) Implicit Memory, where\nhead patterns dominate parametric learning, making it harder to recall long\ntails. In this work, we explore retrieval augmentation in SeRec, to address\nthese limitations. To this end, we propose a Retrieval-Augmented Sequential\nRecommendation framework, named RaSeRec, the main idea of which is to maintain\na dynamic memory bank to accommodate preference drifts and retrieve relevant\nmemories to augment user modeling explicitly. It consists of two stages: (i)\ncollaborative-based pre-training, which learns to recommend and retrieve; (ii)\nretrieval-augmented fine-tuning, which learns to leverage retrieved memories.\nExtensive experiments on three datasets fully demonstrate the superiority and\neffectiveness of RaSeRec.\n","authors":["Xinping Zhao","Baotian Hu","Yan Zhong","Shouzheng Huang","Zihao Zheng","Meng Wang","Haofen Wang","Min zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18378v1.pdf","comment":"20 pages, 8 figures, 8 tables"},{"id":"http://arxiv.org/abs/2412.18376v1","updated":"2024-12-24T12:02:43Z","published":"2024-12-24T12:02:43Z","title":"Bidirectional Topic Matching: Quantifying Thematic Overlap Between\n Corpora Through Topic Modelling","summary":" This study introduces Bidirectional Topic Matching (BTM), a novel method for\ncross-corpus topic modeling that quantifies thematic overlap and divergence\nbetween corpora. BTM is a flexible framework that can incorporate various topic\nmodeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet\nAllocation (LDA). BTM employs a dual-model approach, training separate topic\nmodels for each corpus and applying them reciprocally to enable comprehensive\ncross-corpus comparisons. This methodology facilitates the identification of\nshared themes and unique topics, providing nuanced insights into thematic\nrelationships. Validation against cosine similarity-based methods demonstrates\nthe robustness of BTM, with strong agreement metrics and distinct advantages in\nhandling outlier topics. A case study on climate news articles showcases BTM's\nutility, revealing significant thematic overlaps and distinctions between\ncorpora focused on climate change and climate action. BTM's flexibility and\nprecision make it a valuable tool for diverse applications, from political\ndiscourse analysis to interdisciplinary studies. By integrating shared and\nunique topic analyses, BTM offers a comprehensive framework for exploring\nthematic relationships, with potential extensions to multilingual and dynamic\ndatasets. This work highlights BTM's methodological contributions and its\ncapacity to advance discourse analysis across various domains.\n","authors":["Raven Adam","Marie Lisa Kogler"],"pdf_url":"https://arxiv.org/pdf/2412.18376v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.17690v2","updated":"2024-12-24T11:03:42Z","published":"2024-12-23T16:16:30Z","title":"RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF\n for Conversational QA over KGs with RAG","summary":" Conversational question answering (ConvQA) is a convenient means of searching\nover RDF knowledge graphs (KGs), where a prevalent approach is to translate\nnatural language questions to SPARQL queries. However, SPARQL has certain\nshortcomings: (i) it is brittle for complex intents and conversational\nquestions, and (ii) it is not suitable for more abstract needs. Instead, we\npropose a novel two-pronged system where we fuse: (i) SQL-query results over a\ndatabase automatically derived from the KG, and (ii) text-search results over\nverbalizations of KG facts. Our pipeline supports iterative retrieval: when the\nresults of any branch are found to be unsatisfactory, the system can\nautomatically opt for further rounds. We put everything together in a retrieval\naugmented generation (RAG) setup, where an LLM generates a coherent response\nfrom accumulated search results. We demonstrate the superiority of our proposed\nsystem over several baselines on a knowledge graph of BMW automobiles.\n","authors":["Rishiraj Saha Roy","Chris Hinze","Joel Schlotthauer","Farzad Naderi","Viktor Hangya","Andreas Foltyn","Luzian Hahn","Fabian Kuech"],"pdf_url":"https://arxiv.org/pdf/2412.17690v2.pdf","comment":"Accepted at BTW 2025, 10 pages"},{"id":"http://arxiv.org/abs/2412.18241v1","updated":"2024-12-24T07:51:29Z","published":"2024-12-24T07:51:29Z","title":"An Automatic Graph Construction Framework based on Large Language Models\n for Recommendation","summary":" Graph neural networks (GNNs) have emerged as state-of-the-art methods to\nlearn from graph-structured data for recommendation. However, most existing\nGNN-based recommendation methods focus on the optimization of model structures\nand learning strategies based on pre-defined graphs, neglecting the importance\nof the graph construction stage. Earlier works for graph construction usually\nrely on speciffic rules or crowdsourcing, which are either too simplistic or\ntoo labor-intensive. Recent works start to utilize large language models (LLMs)\nto automate the graph construction, in view of their abundant open-world\nknowledge and remarkable reasoning capabilities. Nevertheless, they generally\nsuffer from two limitations: (1) invisibility of global view (e.g., overlooking\ncontextual information) and (2) construction inefficiency. To this end, we\nintroduce AutoGraph, an automatic graph construction framework based on LLMs\nfor recommendation. Specifically, we first use LLMs to infer the user\npreference and item knowledge, which is encoded as semantic vectors. Next, we\nemploy vector quantization to extract the latent factors from the semantic\nvectors. The latent factors are then incorporated as extra nodes to link the\nuser/item nodes, resulting in a graph with in-depth global-view semantics. We\nfurther design metapath-based message aggregation to effectively aggregate the\nsemantic and collaborative information. The framework is model-agnostic and\ncompatible with different backbone models. Extensive experiments on three\nreal-world datasets demonstrate the efficacy and efffciency of AutoGraph\ncompared to existing baseline methods. We have deployed AutoGraph in Huawei\nadvertising platform, and gain a 2.69% improvement on RPM and a 7.31%\nimprovement on eCPM in the online A/B test. Currently AutoGraph has been used\nas the main trafffc model, serving hundreds of millions of people.\n","authors":["Rong Shan","Jianghao Lin","Chenxu Zhu","Bo Chen","Menghui Zhu","Kangning Zhang","Jieming Zhu","Ruiming Tang","Yong Yu","Weinan Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18241v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2412.18232v1","updated":"2024-12-24T07:30:55Z","published":"2024-12-24T07:30:55Z","title":"Efficient Long Context Language Model Retrieval with Compression","summary":" Long Context Language Models (LCLMs) have emerged as a new paradigm to\nperform Information Retrieval (IR), which enables the direct ingestion and\nretrieval of information by processing an entire corpus in their single\ncontext, showcasing the potential to surpass traditional sparse and dense\nretrieval methods. However, processing a large number of passages within\nin-context for retrieval is computationally expensive, and handling their\nrepresentations during inference further exacerbates the processing time; thus,\nwe aim to make LCLM retrieval more efficient and potentially more effective\nwith passage compression. Specifically, we propose a new compression approach\ntailored for LCLM retrieval, which is trained to maximize the retrieval\nperformance while minimizing the length of the compressed passages. To\naccomplish this, we generate the synthetic data, where compressed passages are\nautomatically created and labeled as chosen or rejected according to their\nretrieval success for a given query, and we train the proposed Compression\nmodel for Long context Retrieval (CoLoR) with this data via preference\noptimization while adding the length regularization loss on top of it to\nenforce brevity. Through extensive experiments on 9 datasets, we show that\nCoLoR improves the retrieval performance by 6% while compressing the in-context\nsize by a factor of 1.91.\n","authors":["Minju Seo","Jinheon Baek","Seongyun Lee","Sung Ju Hwang"],"pdf_url":"https://arxiv.org/pdf/2412.18232v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18176v1","updated":"2024-12-24T05:23:13Z","published":"2024-12-24T05:23:13Z","title":"Molar: Multimodal LLMs with Collaborative Filtering Alignment for\n Enhanced Sequential Recommendation","summary":" Sequential recommendation (SR) systems have evolved significantly over the\npast decade, transitioning from traditional collaborative filtering to deep\nlearning approaches and, more recently, to large language models (LLMs). While\nthe adoption of LLMs has driven substantial advancements, these models\ninherently lack collaborative filtering information, relying primarily on\ntextual content data neglecting other modalities and thus failing to achieve\noptimal recommendation performance. To address this limitation, we propose\nMolar, a Multimodal large language sequential recommendation framework that\nintegrates multiple content modalities with ID information to capture\ncollaborative signals effectively. Molar employs an MLLM to generate unified\nitem representations from both textual and non-textual data, facilitating\ncomprehensive multimodal modeling and enriching item embeddings. Additionally,\nit incorporates collaborative filtering signals through a post-alignment\nmechanism, which aligns user representations from content-based and ID-based\nmodels, ensuring precise personalization and robust performance. By seamlessly\ncombining multimodal content with collaborative filtering insights, Molar\ncaptures both user interests and contextual semantics, leading to superior\nrecommendation accuracy. Extensive experiments validate that Molar\nsignificantly outperforms traditional and LLM-based baselines, highlighting its\nstrength in utilizing multimodal data and collaborative signals for sequential\nrecommendation tasks. The source code is available at\nhttps://anonymous.4open.science/r/Molar-8B06/.\n","authors":["Yucong Luo","Qitao Qin","Hao Zhang","Mingyue Cheng","Ruiran Yan","Kefan Wang","Jie Ouyang"],"pdf_url":"https://arxiv.org/pdf/2412.18176v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18170v1","updated":"2024-12-24T05:07:55Z","published":"2024-12-24T05:07:55Z","title":"Unlocking the Hidden Treasures: Enhancing Recommendations with Unlabeled\n Data","summary":" Collaborative filtering (CF) stands as a cornerstone in recommender systems,\nyet effectively leveraging the massive unlabeled data presents a significant\nchallenge. Current research focuses on addressing the challenge of unlabeled\ndata by extracting a subset that closely approximates negative samples.\nRegrettably, the remaining data are overlooked, failing to fully integrate this\nvaluable information into the construction of user preferences. To address this\ngap, we introduce a novel positive-neutral-negative (PNN) learning paradigm.\nPNN introduces a neutral class, encompassing intricate items that are\nchallenging to categorize directly as positive or negative samples. By training\na model based on this triple-wise partial ranking, PNN offers a promising\nsolution to learning complex user preferences. Through theoretical analysis, we\nconnect PNN to one-way partial AUC (OPAUC) to validate its efficacy.\nImplementing the PNN paradigm is, however, technically challenging because: (1)\nit is difficult to classify unlabeled data into neutral or negative in the\nabsence of supervised signals; (2) there does not exist any loss function that\ncan handle set-level triple-wise ranking relationships. To address these\nchallenges, we propose a semi-supervised learning method coupled with a\nuser-aware attention model for knowledge acquisition and classification\nrefinement. Additionally, a novel loss function with a two-step centroid\nranking approach enables handling set-level rankings. Extensive experiments on\nfour real-world datasets demonstrate that, when combined with PNN, a wide range\nof representative CF models can consistently and significantly boost their\nperformance. Even with a simple matrix factorization, PNN can achieve\ncomparable performance to sophisticated graph neutral networks.\n","authors":["Yuhan Zhao","Rui Chen","Qilong Han","Hongtao Song","Li Chen"],"pdf_url":"https://arxiv.org/pdf/2412.18170v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18168v1","updated":"2024-12-24T05:01:16Z","published":"2024-12-24T05:01:16Z","title":"From Pairwise to Ranking: Climbing the Ladder to Ideal Collaborative\n Filtering with Pseudo-Ranking","summary":" Intuitively, an ideal collaborative filtering (CF) model should learn from\nusers' full rankings over all items to make optimal top-K recommendations. Due\nto the absence of such full rankings in practice, most CF models rely on\npairwise loss functions to approximate full rankings, resulting in an immense\nperformance gap. In this paper, we provide a novel analysis using the multiple\nordinal classification concept to reveal the inevitable gap between a pairwise\napproximation and the ideal case. However, bridging the gap in practice\nencounters two formidable challenges: (1) none of the real-world datasets\ncontains full ranking information; (2) there does not exist a loss function\nthat is capable of consuming ranking information. To overcome these challenges,\nwe propose a pseudo-ranking paradigm (PRP) that addresses the lack of ranking\ninformation by introducing pseudo-rankings supervised by an original noise\ninjection mechanism. Additionally, we put forward a new ranking loss function\ndesigned to handle ranking information effectively. To ensure our method's\nrobustness against potential inaccuracies in pseudo-rankings, we equip the\nranking loss function with a gradient-based confidence mechanism to detect and\nmitigate abnormal gradients. Extensive experiments on four real-world datasets\ndemonstrate that PRP significantly outperforms state-of-the-art methods.\n","authors":["Yuhan Zhao","Rui Chen","Li Chen","Shuang Zhang","Qilong Han","Hongtao Song"],"pdf_url":"https://arxiv.org/pdf/2412.18168v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.00702v3","updated":"2024-12-24T02:48:26Z","published":"2024-03-31T14:41:49Z","title":"LLMTreeRec: Unleashing the Power of Large Language Models for Cold-Start\n Recommendations","summary":" The lack of training data gives rise to the system cold-start problem in\nrecommendation systems, making them struggle to provide effective\nrecommendations. To address this problem, Large Language Models (LLMs) can\nmodel recommendation tasks as language analysis tasks and provide zero-shot\nresults based on their vast open-world knowledge. However, the large scale of\nthe item corpus poses a challenge to LLMs, leading to substantial token\nconsumption that makes it impractical to deploy in real-world recommendation\nsystems. To tackle this challenge, we introduce a tree-based LLM recommendation\nframework LLMTreeRec, which structures all items into an item tree to improve\nthe efficiency of LLM's item retrieval. LLMTreeRec achieves state-of-the-art\nperformance under the system cold-start setting in two widely used datasets,\nwhich is even competitive with conventional deep recommendation systems that\nuse substantial training data. Furthermore, LLMTreeRec outperforms the baseline\nmodel in A/B testing on Huawei industrial systems. Consequently, LLMTreeRec\ndemonstrates its effectiveness as an industry-friendly solution that has been\nsuccessfully deployed online. Our code is available at:\nhttps://github.com/Applied-Machine-Learning-Lab/LLMTreeRec.\n","authors":["Wenlin Zhang","Chuhan Wu","Xiangyang Li","Yuhao Wang","Kuicai Dong","Yichao Wang","Xinyi Dai","Xiangyu Zhao","Huifeng Guo","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2404.00702v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07573v2","updated":"2024-12-24T02:09:03Z","published":"2024-12-10T15:06:48Z","title":"Subtopic-aware View Sampling and Temporal Aggregation for Long-form\n Document Matching","summary":" Long-form document matching aims to judge the relevance between two documents\nand has been applied to various scenarios. Most existing works utilize\nhierarchical or long context models to process documents, which achieve coarse\nunderstanding but may ignore details. Some researchers construct a document\nview with similar sentences about aligned document subtopics to focus on\ndetailed matching signals. However, a long document generally contains multiple\nsubtopics. The matching signals are heterogeneous from multiple topics.\nConsidering only the homologous aligned subtopics may not be representative\nenough and may cause biased modeling. In this paper, we introduce a new\nframework to model representative matching signals. First, we propose to\ncapture various matching signals through subtopics of document pairs. Next, We\nconstruct multiple document views based on subtopics to cover heterogeneous and\nvaluable details. However, existing spatial aggregation methods like attention,\nwhich integrate all these views simultaneously, are hard to integrate\nheterogeneous information. Instead, we propose temporal aggregation, which\neffectively integrates different views gradually as the training progresses.\nExperimental results show that our learning framework is effective on several\ndocument-matching tasks, including news duplication and legal case retrieval.\n","authors":["Youchao Zhou","Heyan Huang","Zhijing Wu","Yuhang Liu","Xinglin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.07573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18092v1","updated":"2024-12-24T02:07:53Z","published":"2024-12-24T02:07:53Z","title":"BRIDGE: Bundle Recommendation via Instruction-Driven Generation","summary":" Bundle recommendation aims to suggest a set of interconnected items to users.\nHowever, diverse interaction types and sparse interaction matrices often pose\nchallenges for previous approaches in accurately predicting user-bundle\nadoptions. Inspired by the distant supervision strategy and generative\nparadigm, we propose BRIDGE, a novel framework for bundle recommendation. It\nconsists of two main components namely the correlation-based item clustering\nand the pseudo bundle generation modules. Inspired by the distant supervision\napproach, the former is to generate more auxiliary information, e.g.,\ninstructive item clusters, for training without using external data. This\ninformation is subsequently aggregated with collaborative signals from user\nhistorical interactions to create pseudo `ideal' bundles. This capability\nallows BRIDGE to explore all aspects of bundles, rather than being limited to\nexisting real-world bundles. It effectively bridging the gap between user\nimagination and predefined bundles, hence improving the bundle recommendation\nperformance. Experimental results validate the superiority of our models over\nstate-of-the-art ranking-based methods across five benchmark datasets.\n","authors":["Tuan-Nghia Bui","Huy-Son Nguyen","Cam-Van Nguyen Thi","Hoang-Quynh Le","Duc-Trong Le"],"pdf_url":"https://arxiv.org/pdf/2412.18092v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18082v1","updated":"2024-12-24T01:38:19Z","published":"2024-12-24T01:38:19Z","title":"Prompt Tuning for Item Cold-start Recommendation","summary":" The item cold-start problem is crucial for online recommender systems, as the\nsuccess of the cold-start phase determines whether items can transition into\npopular ones. Prompt learning, a powerful technique used in natural language\nprocessing (NLP) to address zero- or few-shot problems, has been adapted for\nrecommender systems to tackle similar challenges. However, existing methods\ntypically rely on content-based properties or text descriptions for prompting,\nwhich we argue may be suboptimal for cold-start recommendations due to 1)\nsemantic gaps with recommender tasks, 2) model bias caused by warm-up items\ncontribute most of the positive feedback to the model, which is the core of the\ncold-start problem that hinders the recommender quality on cold-start items. We\npropose to leverage high-value positive feedback, termed pinnacle feedback as\nprompt information, to simultaneously resolve the above two problems. We\nexperimentally prove that compared to the content description proposed in\nexisting works, the positive feedback is more suitable to serve as prompt\ninformation by bridging the semantic gaps. Besides, we propose item-wise\npersonalized prompt networks to encode pinnaclce feedback to relieve the model\nbias by the positive feedback dominance problem. Extensive experiments on four\nreal-world datasets demonstrate the superiority of our model over\nstate-of-the-art methods. Moreover, PROMO has been successfully deployed on a\npopular short-video sharing platform, a billion-user scale commercial\nshort-video application, achieving remarkable performance gains across various\ncommercial metrics within cold-start scenarios\n","authors":["Yuezihan Jiang","Gaode Chen","Wenhan Zhang","Jingchi Wang","Yinjie Jiang","Qi Zhang","Jingjian Lin","Peng Jiang","Kaigui Bian"],"pdf_url":"https://arxiv.org/pdf/2412.18082v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.18601v1","updated":"2024-12-24T18:56:00Z","published":"2024-12-24T18:56:00Z","title":"Decentralized Intelligence in GameFi: Embodied AI Agents and the\n Convergence of DeFi and Virtual Ecosystems","summary":" In the rapidly evolving landscape of GameFi, a fusion of gaming and\ndecentralized finance (DeFi), there exists a critical need to enhance player\nengagement and economic interaction within gaming ecosystems. Our GameFi\necosystem aims to fundamentally transform this landscape by integrating\nadvanced embodied AI agents into GameFi platforms. These AI agents, developed\nusing cutting-edge large language models (LLMs), such as GPT-4 and Claude AI,\nare capable of proactive, adaptive, and contextually rich interactions with\nplayers. By going beyond traditional scripted responses, these agents become\nintegral participants in the game's narrative and economic systems, directly\ninfluencing player strategies and in-game economies. We address the limitations\nof current GameFi platforms, which often lack immersive AI interactions and\nmechanisms for community engagement or creator monetization. Through the deep\nintegration of AI agents with blockchain technology, we establish a\nconsensus-driven, decentralized GameFi ecosystem. This ecosystem empowers\ncreators to monetize their contributions and fosters democratic collaboration\namong players and creators. Furthermore, by embedding DeFi mechanisms into the\ngaming experience, we enhance economic participation and provide new\nopportunities for financial interactions within the game. Our approach enhances\nplayer immersion and retention and advances the GameFi ecosystem by bridging\ntraditional gaming with Web3 technologies. By integrating sophisticated AI and\nDeFi elements, we contribute to the development of more engaging, economically\nrobust, and community-centric gaming environments. This project represents a\nsignificant advancement in the state-of-the-art in GameFi, offering insights\nand methodologies that can be applied throughout the gaming industry.\n","authors":["Fernando Jia","Jade Zheng","Florence Li"],"pdf_url":"https://arxiv.org/pdf/2412.18601v1.pdf","comment":"11 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.18594v1","updated":"2024-12-24T18:49:13Z","published":"2024-12-24T18:49:13Z","title":"Structure Learning in Gaussian Graphical Models from Glauber Dynamics","summary":" Gaussian graphical model selection is an important paradigm with numerous\napplications, including biological network modeling, financial network\nmodeling, and social network analysis. Traditional approaches assume access to\nindependent and identically distributed (i.i.d) samples, which is often\nimpractical in real-world scenarios. In this paper, we address Gaussian\ngraphical model selection under observations from a more realistic dependent\nstochastic process known as Glauber dynamics. Glauber dynamics, also called the\nGibbs sampler, is a Markov chain that sequentially updates the variables of the\nunderlying model based on the statistics of the remaining model. Such models,\naside from frequently being employed to generate samples from complex\nmultivariate distributions, naturally arise in various settings, such as\nopinion consensus in social networks and clearing/stock-price dynamics in\nfinancial networks.\n In contrast to the extensive body of existing work, we present the first\nalgorithm for Gaussian graphical model selection when data are sampled\naccording to the Glauber dynamics. We provide theoretical guarantees on the\ncomputational and statistical complexity of the proposed algorithm's structure\nlearning performance. Additionally, we provide information-theoretic lower\nbounds on the statistical complexity and show that our algorithm is nearly\nminimax optimal for a broad class of problems.\n","authors":["Vignesh Tirukkonda","Anirudh Rayas","Gautam Dasarathy"],"pdf_url":"https://arxiv.org/pdf/2412.18594v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18584v1","updated":"2024-12-24T18:25:50Z","published":"2024-12-24T18:25:50Z","title":"Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors:\n Diverse-Resolution Training Outperforms Interpolation","summary":" Deep learning-based 3D imaging, in particular magnetic resonance imaging\n(MRI), is challenging because of limited availability of 3D training data.\nTherefore, 2D diffusion models trained on 2D slices are starting to be\nleveraged for 3D MRI reconstruction. However, as we show in this paper,\nexisting methods pertain to a fixed voxel size, and performance degrades when\nthe voxel size is varied, as it is often the case in clinical practice. In this\npaper, we propose and study several approaches for resolution-robust 3D MRI\nreconstruction with 2D diffusion priors. As a result of this investigation, we\nobtain a simple resolution-robust variational 3D reconstruction approach based\non diffusion-guided regularization of randomly sampled 2D slices. This method\nprovides competitive reconstruction quality compared to posterior sampling\nbaselines. Towards resolving the sensitivity to resolution-shifts, we\ninvestigate state-of-the-art model-based approaches including Gaussian\nsplatting, neural representations, and infinite-dimensional diffusion models,\nas well as a simple data-centric approach of training the diffusion model on\nseveral resolutions. Our experiments demonstrate that the model-based\napproaches fail to close the performance gap in 3D MRI. In contrast, the\ndata-centric approach of training the diffusion model on various resolutions\neffectively provides a resolution-robust method without compromising accuracy.\n","authors":["Anselm Krainovic","Stefan Ruschke","Reinhard Heckel"],"pdf_url":"https://arxiv.org/pdf/2412.18584v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18582v1","updated":"2024-12-24T18:18:52Z","published":"2024-12-24T18:18:52Z","title":"Exploring Embedding Priors in Prompt-Tuning for Improved\n Interpretability and Control","summary":" Prompt-Tuning is an efficient method for adapting pre-trained language models\nto new tasks with minimal computational overhead by modifying prompt\nembeddings. In this work, we investigate how crucial the phenomenon of\nembedding collapse, frequently observed in Prompt-Tuning, is for the final\nperformance of the model. To address this question, we designed embedding\npriors and compared them with posteriors of the converged Soft and Deep\nPrompt-Tuning methods. Our findings suggest that priors strongly affect the\nposition of the tuned embeddings, and models can effectively work with\nembeddings from different parts of activation spaces, including completely new\nregions. As the final Prompt-Tuning capabilities are limited, we hypothesize\nthat controllable Prompt-Tuning posteriors may serve as a good starting point\nfor tasks such as chain-of-thought (COT) distillation. Our experiments also\nshow that generated trajectories are not localized in the activation space of\nthe models. However, there are distinct clusters of activations for distant\ntasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g.,\nQuestion-Answering and MLM) lie in the same cluster. These observations raise\nquestions about the importance of a single activation cluster for the\ngeneralization abilities of large language models.\n","authors":["Sergey Sedov","Sumanth Bharadwaj Hachalli Karanam","Venu Gopal Kadamba"],"pdf_url":"https://arxiv.org/pdf/2412.18582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18579v1","updated":"2024-12-24T18:11:01Z","published":"2024-12-24T18:11:01Z","title":"ReducedLUT: Table Decomposition with \"Don't Care\" Conditions","summary":" Lookup tables (LUTs) are frequently used to efficiently store arrays of\nprecomputed values for complex mathematical computations. When used in the\ncontext of neural networks, these functions exhibit a lack of recognizable\npatterns which presents an unusual challenge for conventional logic synthesis\ntechniques. Several approaches are known to break down a single large lookup\ntable into multiple smaller ones that can be recombined. Traditional methods,\nsuch as plain tabulation, piecewise linear approximation, and multipartite\ntable methods, often yield inefficient hardware solutions when applied to\nLUT-based NNs.\n This paper introduces ReducedLUT, a novel method to reduce the footprint of\nthe LUTs by injecting don't cares into the compression process. This additional\nfreedom introduces more self-similarities which can be exploited using known\ndecomposition techniques. We then demonstrate a particular application to\nmachine learning; by replacing unobserved patterns within the training data of\nneural network models with don't cares, we enable greater compression with\nminimal model accuracy degradation. In practice, we achieve up to $1.63\\times$\nreduction in Physical LUT utilization, with a test accuracy drop of no more\nthan $0.01$ accuracy points.\n","authors":["Oliver Cassidy","Marta Andronic","Samuel Coward","George A. Constantinides"],"pdf_url":"https://arxiv.org/pdf/2412.18579v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.10854v2","updated":"2024-12-24T18:05:03Z","published":"2024-07-15T16:06:20Z","title":"Principal Component Flow Map Learning of PDEs from Incomplete, Limited,\n and Noisy Data","summary":" We present a computational technique for modeling the evolution of dynamical\nsystems in a reduced basis, with a focus on the challenging problem of modeling\npartially-observed partial differential equations (PDEs) on high-dimensional\nnon-uniform grids. We address limitations of previous work on data-driven flow\nmap learning in the sense that we focus on noisy and limited data to move\ntoward data collection scenarios in real-world applications. Leveraging recent\nwork on modeling PDEs in modal and nodal spaces, we present a neural network\nstructure that is suitable for PDE modeling with noisy and limited data\navailable only on a subset of the state variables or computational domain. In\nparticular, spatial grid-point measurements are reduced using a learned linear\ntransformation, after which the dynamics are learned in this reduced basis\nbefore being transformed back out to the nodal space. This approach yields a\ndrastically reduced parameterization of the neural network compared with\nprevious flow map models for nodal space learning. This allows for rapid\nhigh-resolution simulations, enabled by smaller training data sets and reduced\ntraining times.\n","authors":["Victor Churchill"],"pdf_url":"https://arxiv.org/pdf/2407.10854v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18571v1","updated":"2024-12-24T17:51:42Z","published":"2024-12-24T17:51:42Z","title":"Scalable Quantum-Inspired Optimization through Dynamic Qubit Compression","summary":" Hard combinatorial optimization problems, often mapped to Ising models,\npromise potential solutions with quantum advantage but are constrained by\nlimited qubit counts in near-term devices. We present an innovative\nquantum-inspired framework that dynamically compresses large Ising models to\nfit available quantum hardware of different sizes. Thus, we aim to bridge the\ngap between large-scale optimization and current hardware capabilities. Our\nmethod leverages a physics-inspired GNN architecture to capture complex\ninteractions in Ising models and accurately predict alignments among\nneighboring spins (aka qubits) at ground states. By progressively merging such\naligned spins, we can reduce the model size while preserving the underlying\noptimization structure. It also provides a natural trade-off between the\nsolution quality and size reduction, meeting different hardware constraints of\nquantum computing devices. Extensive numerical studies on Ising instances of\ndiverse topologies show that our method can reduce instance size at multiple\nlevels with virtually no losses in solution quality on the latest D-wave\nquantum annealers.\n","authors":["Co Tran","Quoc-Bao Tran","Hy Truong Son","Thang N Dinh"],"pdf_url":"https://arxiv.org/pdf/2412.18571v1.pdf","comment":"Accepted to AAAI'25"},{"id":"http://arxiv.org/abs/2412.18568v1","updated":"2024-12-24T17:41:41Z","published":"2024-12-24T17:41:41Z","title":"HNCI: High-Dimensional Network Causal Inference","summary":" The problem of evaluating the effectiveness of a treatment or policy commonly\nappears in causal inference applications under network interference. In this\npaper, we suggest the new method of high-dimensional network causal inference\n(HNCI) that provides both valid confidence interval on the average direct\ntreatment effect on the treated (ADET) and valid confidence set for the\nneighborhood size for interference effect. We exploit the model setting in\nBelloni et al. (2022) and allow certain type of heterogeneity in node\ninterference neighborhood sizes. We propose a linear regression formulation of\npotential outcomes, where the regression coefficients correspond to the\nunderlying true interference function values of nodes and exhibit a latent\nhomogeneous structure. Such a formulation allows us to leverage existing\nliterature from linear regression and homogeneity pursuit to conduct valid\nstatistical inferences with theoretical guarantees. The resulting confidence\nintervals for the ADET are formally justified through asymptotic normalities\nwith estimable variances. We further provide the confidence set for the\nneighborhood size with theoretical guarantees exploiting the repro samples\napproach. The practical utilities of the newly suggested methods are\ndemonstrated through simulation and real data examples.\n","authors":["Wenqin Du","Rundong Ding","Yingying Fan","Jinchi Lv"],"pdf_url":"https://arxiv.org/pdf/2412.18568v1.pdf","comment":"89 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.18564v1","updated":"2024-12-24T17:36:27Z","published":"2024-12-24T17:36:27Z","title":"Efficient Aircraft Design Optimization Using Multi-Fidelity Models and\n Multi-fidelity Physics Informed Neural Networks","summary":" Aircraft design optimization traditionally relies on computationally\nexpensive simulation techniques such as Finite Element Method (FEM) and Finite\nVolume Method (FVM), which, while accurate, can significantly slow down the\ndesign iteration process. The challenge lies in reducing the computational\ncomplexity while maintaining high accuracy for quick evaluations of multiple\ndesign alternatives. This research explores advanced methods, including\nsurrogate models, reduced-order models (ROM), and multi-fidelity machine\nlearning techniques, to achieve more efficient aircraft design evaluations.\nSpecifically, the study investigates the application of Multi-fidelity\nPhysics-Informed Neural Networks (MPINN) and autoencoders for manifold\nalignment, alongside the potential of Generative Adversarial Networks (GANs)\nfor refining design geometries. Through a proof-of-concept task, the research\ndemonstrates the ability to predict high-fidelity results from low-fidelity\nsimulations, offering a path toward faster and more cost effective aircraft\ndesign iterations.\n","authors":["Apurba Sarker"],"pdf_url":"https://arxiv.org/pdf/2412.18564v1.pdf","comment":"7 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.18557v1","updated":"2024-12-24T17:20:43Z","published":"2024-12-24T17:20:43Z","title":"FedVCK: Non-IID Robust and Communication-Efficient Federated Learning\n via Valuable Condensed Knowledge for Medical Image Analysis","summary":" Federated learning has become a promising solution for collaboration among\nmedical institutions. However, data owned by each institution would be highly\nheterogeneous and the distribution is always non-independent and identical\ndistribution (non-IID), resulting in client drift and unsatisfactory\nperformance. Despite existing federated learning methods attempting to solve\nthe non-IID problems, they still show marginal advantages but rely on frequent\ncommunication which would incur high costs and privacy concerns. In this paper,\nwe propose a novel federated learning method: \\textbf{Fed}erated learning via\n\\textbf{V}aluable \\textbf{C}ondensed \\textbf{K}nowledge (FedVCK). We enhance\nthe quality of condensed knowledge and select the most necessary knowledge\nguided by models, to tackle the non-IID problem within limited communication\nbudgets effectively. Specifically, on the client side, we condense the\nknowledge of each client into a small dataset and further enhance the\ncondensation procedure with latent distribution constraints, facilitating the\neffective capture of high-quality knowledge. During each round, we specifically\ntarget and condense knowledge that has not been assimilated by the current\nmodel, thereby preventing unnecessary repetition of homogeneous knowledge and\nminimizing the frequency of communications required. On the server side, we\npropose relational supervised contrastive learning to provide more supervision\nsignals to aid the global model updating. Comprehensive experiments across\nvarious medical tasks show that FedVCK can outperform state-of-the-art methods,\ndemonstrating that it's non-IID robust and communication-efficient.\n","authors":["Guochen Yan","Luyuan Xie","Xinyi Gao","Wentao Zhang","Qingni Shen","Yuejian Fang","Zhonghai Wu"],"pdf_url":"https://arxiv.org/pdf/2412.18557v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.18547v1","updated":"2024-12-24T16:55:45Z","published":"2024-12-24T16:55:45Z","title":"Token-Budget-Aware LLM Reasoning","summary":" Reasoning is critical for large language models (LLMs) to excel in a wide\nrange of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM\nperformance by decomposing problems into intermediate steps, they also incur\nsignificant overhead in token usage, leading to increased costs. We find that\nthe reasoning process of current LLMs is unnecessarily lengthy and it can be\ncompressed by including a reasonable token budget in the prompt, but the choice\nof token budget plays a crucial role in the actual compression effectiveness.\nWe then propose a token-budget-aware LLM reasoning framework, which dynamically\nestimates token budgets for different problems based on reasoning complexity\nand uses the estimated token budgets to guide the reasoning process.\nExperiments show that our method effectively reduces token costs in CoT\nreasoning with only a slight performance reduction, offering a practical\nsolution to balance efficiency and accuracy in LLM reasoning. Code:\nhttps://github.com/GeniusHTX/TALE.\n","authors":["Tingxu Han","Chunrong Fang","Shiyu Zhao","Shiqing Ma","Zhenyu Chen","Zhenting Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18547v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18544v1","updated":"2024-12-24T16:51:35Z","published":"2024-12-24T16:51:35Z","title":"Consistency Checks for Language Model Forecasters","summary":" Forecasting is a task that is difficult to evaluate: the ground truth can\nonly be known in the future. Recent work showing LLM forecasters rapidly\napproaching human-level performance begs the question: how can we benchmark and\nevaluate these forecasters instantaneously? Following the consistency check\nframework, we measure the performance of forecasters in terms of the\nconsistency of their predictions on different logically-related questions. We\npropose a new, general consistency metric based on arbitrage: for example, if a\nforecasting AI illogically predicts that both the Democratic and Republican\nparties have 60% probability of winning the 2024 US presidential election, an\narbitrageur can trade against the forecaster's predictions and make a profit.\nWe build an automated evaluation system that generates a set of base questions,\ninstantiates consistency checks from these questions, elicits the predictions\nof the forecaster, and measures the consistency of the predictions. We then\nbuild a standard, proper-scoring-rule forecasting benchmark, and show that our\n(instantaneous) consistency metrics correlate with LLM forecasters' ground\ntruth Brier scores (which are only known in the future). We also release a\nconsistency benchmark that resolves in 2028, providing a long-term evaluation\ntool for forecasting.\n","authors":["Daniel Paleka","Abhimanyu Pallavi Sudhir","Alejandro Alvarez","Vineeth Bhat","Adam Shen","Evan Wang","Florian Tramèr"],"pdf_url":"https://arxiv.org/pdf/2412.18544v1.pdf","comment":"56 pages, 25 figures. Submitted to ICLR 2025"},{"id":"http://arxiv.org/abs/2412.18539v1","updated":"2024-12-24T16:42:45Z","published":"2024-12-24T16:42:45Z","title":"Convergence of Statistical Estimators via Mutual Information Bounds","summary":" Recent advances in statistical learning theory have revealed profound\nconnections between mutual information (MI) bounds, PAC-Bayesian theory, and\nBayesian nonparametrics. This work introduces a novel mutual information bound\nfor statistical models. The derived bound has wide-ranging applications in\nstatistical inference. It yields improved contraction rates for fractional\nposteriors in Bayesian nonparametrics. It can also be used to study a wide\nrange of estimation methods, such as variational inference or Maximum\nLikelihood Estimation (MLE). By bridging these diverse areas, this work\nadvances our understanding of the fundamental limits of statistical inference\nand the role of information in learning from data. We hope that these results\nwill not only clarify connections between statistical inference and information\ntheory but also help to develop a new toolbox to study a wide range of\nestimators.\n","authors":["El Mahdi Khribch","Pierre Alquier"],"pdf_url":"https://arxiv.org/pdf/2412.18539v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18535v1","updated":"2024-12-24T16:34:50Z","published":"2024-12-24T16:34:50Z","title":"Graph Structure Learning for Spatial-Temporal Imputation: Adapting to\n Node and Feature Scales","summary":" Spatial-temporal data collected across different geographic locations often\nsuffer from missing values, posing challenges to data analysis. Existing\nmethods primarily leverage fixed spatial graphs to impute missing values, which\nimplicitly assume that the spatial relationship is roughly the same for all\nfeatures across different locations. However, they may overlook the different\nspatial relationships of diverse features recorded by sensors in different\nlocations. To address this, we introduce the multi-scale Graph Structure\nLearning framework for spatial-temporal Imputation (GSLI) that dynamically\nadapts to the heterogeneous spatial correlations. Our framework encompasses\nnode-scale graph structure learning to cater to the distinct global spatial\ncorrelations of different features, and feature-scale graph structure learning\nto unveil common spatial correlation across features within all stations.\nIntegrated with prominence modeling, our framework emphasizes nodes and\nfeatures with greater significance in the imputation process. Furthermore, GSLI\nincorporates cross-feature and cross-temporal representation learning to\ncapture spatial-temporal dependencies. Evaluated on six real incomplete\nspatial-temporal datasets, GSLI showcases the improvement in data imputation.\n","authors":["Xinyu Yang","Yu Sun","Xinyang Chen","Ying Zhang","Xiaojie Yuan"],"pdf_url":"https://arxiv.org/pdf/2412.18535v1.pdf","comment":"This paper has been accepted as a full paper at AAAI 2025"},{"id":"http://arxiv.org/abs/2408.14234v2","updated":"2024-12-24T16:27:42Z","published":"2024-08-26T12:49:41Z","title":"FSDEM: Feature Selection Dynamic Evaluation Metric","summary":" Expressive evaluation metrics are indispensable for informative experiments\nin all areas, and while several metrics are established in some areas, in\nothers, such as feature selection, only indirect or otherwise limited\nevaluation metrics are found. In this paper, we propose a novel evaluation\nmetric to address several problems of its predecessors and allow for flexible\nand reliable evaluation of feature selection algorithms. The proposed metric is\na dynamic metric with two properties that can be used to evaluate both the\nperformance and the stability of a feature selection algorithm. We conduct\nseveral empirical experiments to illustrate the use of the proposed metric in\nthe successful evaluation of feature selection algorithms. We also provide a\ncomparison and analysis to show the different aspects involved in the\nevaluation of the feature selection algorithms. The results indicate that the\nproposed metric is successful in carrying out the evaluation task for feature\nselection algorithms.\n This paper is an extended version of a paper published at SISAP 2024.\n","authors":["Muhammad Rajabinasab","Anton D. Lautrup","Tobias Hyrup","Arthur Zimek"],"pdf_url":"https://arxiv.org/pdf/2408.14234v2.pdf","comment":"Short version of this paper is published at 17th International\n Conference on Similarity Search and Applications, SISAP 2024"},{"id":"http://arxiv.org/abs/2412.18534v1","updated":"2024-12-24T16:27:19Z","published":"2024-12-24T16:27:19Z","title":"GCN-ABFT: Low-Cost Online Error Checking for Graph Convolutional\n Networks","summary":" Graph convolutional networks (GCNs) are popular for building machine-learning\napplication for graph-structured data. This widespread adoption led to the\ndevelopment of specialized GCN hardware accelerators. In this work, we address\na key architectural challenge for GCN accelerators: how to detect errors in GCN\ncomputations arising from random hardware faults with the least computation\ncost. Each GCN layer performs a graph convolution, mathematically equivalent to\nmultiplying three matrices, computed through two separate matrix\nmultiplications. Existing Algorithm-based Fault Tolerance(ABFT) techniques can\ncheck the results of individual matrix multiplications. However, for a GCN\nlayer, this check should be performed twice. To avoid this overhead, this work\nintroduces GCN-ABFT that directly calculates a checksum for the entire\nthree-matrix product within a single GCN layer, providing a cost-effective\napproach for error detection in GCN accelerators. Experimental results\ndemonstrate that GCN-ABFT reduces the number of operations needed for checksum\ncomputation by over 21% on average for representative GCN applications. These\nsavings are achieved without sacrificing fault-detection accuracy, as evidenced\nby the presented fault-injection analysis.\n","authors":["Christodoulos Peltekis","Giorgos Dimitrakopoulos"],"pdf_url":"https://arxiv.org/pdf/2412.18534v1.pdf","comment":"Accepted for publication at IEEE Transactions on Computer-Aided\n Design of Integrated Circuits and Systems (TCAD)"},{"id":"http://arxiv.org/abs/2408.14909v2","updated":"2024-12-24T16:25:27Z","published":"2024-08-27T09:35:49Z","title":"SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking\n State Space Models","summary":" Known as low energy consumption networks, spiking neural networks (SNNs) have\ngained a lot of attention within the past decades. While SNNs are increasing\ncompetitive with artificial neural networks (ANNs) for vision tasks, they are\nrarely used for long sequence tasks, despite their intrinsic temporal dynamics.\nIn this work, we develop spiking state space models (SpikingSSMs) for long\nsequence learning by leveraging on the sequence learning abilities of state\nspace models (SSMs). Inspired by dendritic neuron structure, we hierarchically\nintegrate neuronal dynamics with the original SSM block, meanwhile realizing\nsparse synaptic computation. Furthermore, to solve the conflict of event-driven\nneuronal dynamics with parallel computing, we propose a light-weight surrogate\ndynamic network which accurately predicts the after-reset membrane potential\nand compatible to learnable thresholds, enabling orders of acceleration in\ntraining speed compared with conventional iterative methods. On the long range\narena benchmark task, SpikingSSM achieves competitive performance to\nstate-of-the-art SSMs meanwhile realizing on average 90\\% of network sparsity.\nOn language modeling, our network significantly surpasses existing spiking\nlarge language models (spikingLLMs) on the WikiText-103 dataset with only a\nthird of the model size, demonstrating its potential as backbone architecture\nfor low computation cost LLMs.\n","authors":["Shuaijie Shen","Chao Wang","Renzhuo Huang","Yan Zhong","Qinghai Guo","Zhichao Lu","Jianguo Zhang","Luziwei Leng"],"pdf_url":"https://arxiv.org/pdf/2408.14909v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18530v1","updated":"2024-12-24T16:24:43Z","published":"2024-12-24T16:24:43Z","title":"Characterizations of Language Generation With Breadth","summary":" We study language generation in the limit, introduced by Kleinberg and\nMullainathan [KM24], building on classical works of Gold [Gol67] and Angluin\n[Ang79]. [KM24] proposed an algorithm that generates strings from any countable\nlanguage collection in the limit. While their algorithm eventually outputs\nstrings from the target language $K$, it sacrifices breadth, i.e., the ability\nto generate all strings in $K$. A key open question in [KM24] is whether this\ntrade-off between consistency and breadth is inherrent.\n Recent works proposed different notions of consistent generation with\nbreadth. Kalavasis, Mehrotra, and Velegkas [KVM24] introduced three\ndefinitions: generation with exact breadth, approximate breadth, and\nunambiguous generation. Concurrently and independently, Charikar and Pabbaraju\n[CP24a] proposed exhaustive generation. Both works examined when generation\nwith these notions of breadth is possible.\n Building on [CP24a, KVM24], we fully characterize language generation for\nthese notions and their natural combinations. For exact breadth, we provide an\nunconditional lower bound, removing a technical condition from [KVM24] and\nextending the result of [CP24a] that holds for specific collections of\nlanguages. We show that generation with exact breadth is characterized by\nAngluin's condition for identification. We further introduce a weaker version\nof Angluin's condition that tightly characterizes both approximate breadth and\nexhaustive generation, proving their equivalence. Additionally, we show that\nunambiguous generation is also characterized by Angluin's condition as a\nspecial case of a broader result. Finally, we strengthen [KVM24] by giving\nunconditional lower bounds for stable generators, showing that Angluin's\ncondition characterizes the previous breadth notions for stable generators.\nThis shows a separation between stable and unstable generation with approximate\nbreadth.\n","authors":["Alkis Kalavasis","Anay Mehrotra","Grigoris Velegkas"],"pdf_url":"https://arxiv.org/pdf/2412.18530v1.pdf","comment":"Abstract shortened to fix arXiv limit"},{"id":"http://arxiv.org/abs/2412.18529v1","updated":"2024-12-24T16:24:29Z","published":"2024-12-24T16:24:29Z","title":"Accelerating process control and optimization via machine learning: A\n review","summary":" Process control and optimization have been widely used to solve\ndecision-making problems in chemical engineering applications. However,\nidentifying and tuning the best solution algorithm is challenging and\ntime-consuming. Machine learning tools can be used to automate these steps by\nlearning the behavior of a numerical solver from data. In this paper, we\ndiscuss recent advances in (i) the representation of decision-making problems\nfor machine learning tasks, (ii) algorithm selection, and (iii) algorithm\nconfiguration for monolithic and decomposition-based algorithms. Finally, we\ndiscuss open problems related to the application of machine learning for\naccelerating process optimization and control.\n","authors":["Ilias Mitrai","Prodromos Daoutidis"],"pdf_url":"https://arxiv.org/pdf/2412.18529v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18518v1","updated":"2024-12-24T15:55:30Z","published":"2024-12-24T15:55:30Z","title":"Bayesian Optimization of Bilevel Problems","summary":" Bilevel optimization, a hierarchical mathematical framework where one\noptimization problem is nested within another, has emerged as a powerful tool\nfor modeling complex decision-making processes in various fields such as\neconomics, engineering, and machine learning. This paper focuses on bilevel\noptimization where both upper-level and lower-level functions are black boxes\nand expensive to evaluate. We propose a Bayesian Optimization framework that\nmodels the upper and lower-level functions as Gaussian processes over the\ncombined space of upper and lower-level decisions, allowing us to exploit\nknowledge transfer between different sub-problems. Additionally, we propose a\nnovel acquisition function for this model. Our experimental results demonstrate\nthat the proposed algorithm is highly sample-efficient and outperforms existing\nmethods in finding high-quality solutions.\n","authors":["Omer Ekmekcioglu","Nursen Aydin","Juergen Branke"],"pdf_url":"https://arxiv.org/pdf/2412.18518v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.23569v2","updated":"2024-12-24T15:53:18Z","published":"2024-10-31T02:25:43Z","title":"RA-RLHF: Provably Efficient Risk-Aware Reinforcement Learning Human\n Feedback","summary":" Reinforcement Learning Human Feedback (RLHF) studies the problem where agents\nreceive only preferences over pairs of trajectories in each episode.\nTraditional approaches in this field have predominantly focused on the mean\nreward or utility criterion. However, in RLHF scenarios demanding heightened\nrisk awareness, such as in AI systems, healthcare, and agriculture, risk-aware\nmeasures are requisite. Traditional risk-aware objectives and algorithms are\nnot applicable in such one-episode-reward settings. To address this, we explore\nand prove the applicability of two risk-aware objectives to RLHF: nested and\nstatic quantile risk objectives. We also introduce Risk-Aware-RLHF (RA-RLHF),\nan algorithm designed to optimize both nested and static objectives.\nAdditionally, we provide a theoretical analysis of the regret upper bounds,\ndemonstrating that they are sublinear with respect to the number of episodes,\nand present empirical results to support our findings. Our code is available in\nhttps://github.com/aguilarjose11/pbrlNeurips.\n","authors":["Yujie Zhao","Jose Efraim Aguilar Escamill","Weyl Lu","Huazheng Wang"],"pdf_url":"https://arxiv.org/pdf/2410.23569v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18515v1","updated":"2024-12-24T15:52:51Z","published":"2024-12-24T15:52:51Z","title":"Subsampling, aligning, and averaging to find circular coordinates in\n recurrent time series","summary":" We introduce a new algorithm for finding robust circular coordinates on data\nthat is expected to exhibit recurrence, such as that which appears in neuronal\nrecordings of C. elegans. Techniques exist to create circular coordinates on a\nsimplicial complex from a dimension 1 cohomology class, and these can be\napplied to the Rips complex of a dataset when it has a prominent class in its\ndimension 1 cohomology. However, it is known this approach is extremely\nsensitive to uneven sampling density.\n Our algorithm comes with a new method to correct for uneven sampling density,\nadapting our prior work on averaging coordinates in manifold learning. We use\nrejection sampling to correct for inhomogeneous sampling and then apply\nProcrustes matching to align and average the subsamples. In addition to\nproviding a more robust coordinate than other approaches, this subsampling and\naveraging approach has better efficiency.\n We validate our technique on both synthetic data sets and neuronal activity\nrecordings. Our results reveal a topological model of neuronal trajectories for\nC. elegans that is constructed from loops in which different regions of the\nbrain state space can be mapped to specific and interpretable macroscopic\nbehaviors in the worm.\n","authors":["Andrew J. Blumberg","Mathieu Carrière","Jun Hou Fung","Michael A. Mandell"],"pdf_url":"https://arxiv.org/pdf/2412.18515v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18513v1","updated":"2024-12-24T15:52:21Z","published":"2024-12-24T15:52:21Z","title":"FedGIG: Graph Inversion from Gradient in Federated Learning","summary":" Recent studies have shown that Federated learning (FL) is vulnerable to\nGradient Inversion Attacks (GIA), which can recover private training data from\nshared gradients. However, existing methods are designed for dense, continuous\ndata such as images or vectorized texts, and cannot be directly applied to\nsparse and discrete graph data. This paper first explores GIA's impact on\nFederated Graph Learning (FGL) and introduces Graph Inversion from Gradient in\nFederated Learning (FedGIG), a novel GIA method specifically designed for\ngraph-structured data. FedGIG includes the adjacency matrix constraining\nmodule, which ensures the sparsity and discreteness of the reconstructed graph\ndata, and the subgraph reconstruction module, which is designed to complete\nmissing common subgraph structures. Extensive experiments on molecular datasets\ndemonstrate FedGIG's superior accuracy over existing GIA techniques.\n","authors":["Tianzhe Xiao","Yichen Li","Yining Qi","Haozhao Wang","Ruixuan Li"],"pdf_url":"https://arxiv.org/pdf/2412.18513v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18507v1","updated":"2024-12-24T15:47:25Z","published":"2024-12-24T15:47:25Z","title":"An Empirical Analysis of Federated Learning Models Subject to\n Label-Flipping Adversarial Attack","summary":" In this paper, we empirically analyze adversarial attacks on selected\nfederated learning models. The specific learning models considered are\nMultinominal Logistic Regression (MLR), Support Vector Classifier (SVC),\nMultilayer Perceptron (MLP), Convolution Neural Network (CNN), %Recurrent\nNeural Network (RNN), Random Forest, XGBoost, and Long Short-Term Memory\n(LSTM). For each model, we simulate label-flipping attacks, experimenting\nextensively with 10 federated clients and 100 federated clients. We vary the\npercentage of adversarial clients from 10% to 100% and, simultaneously, the\npercentage of labels flipped by each adversarial client is also varied from 10%\nto 100%. Among other results, we find that models differ in their inherent\nrobustness to the two vectors in our label-flipping attack, i.e., the\npercentage of adversarial clients, and the percentage of labels flipped by each\nadversarial client. We discuss the potential practical implications of our\nresults.\n","authors":["Kunal Bhatnagar","Sagana Chattanathan","Angela Dang","Bhargav Eranki","Ronnit Rana","Charan Sridhar","Siddharth Vedam","Angie Yao","Mark Stamp"],"pdf_url":"https://arxiv.org/pdf/2412.18507v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18505v1","updated":"2024-12-24T15:43:04Z","published":"2024-12-24T15:43:04Z","title":"VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry\n Extraction from First-Person View Flight Data","summary":" This paper presents the Visual Optical Recognition Telemetry EXtraction\n(VORTEX) system for extracting and analyzing drone telemetry data from First\nPerson View (FPV) Uncrewed Aerial System (UAS) footage. VORTEX employs MMOCR, a\nPyTorch-based Optical Character Recognition (OCR) toolbox, to extract telemetry\nvariables from drone Heads Up Display (HUD) recordings, utilizing advanced\nimage preprocessing techniques, including CLAHE enhancement and adaptive\nthresholding. The study optimizes spatial accuracy and computational efficiency\nthrough systematic investigation of temporal sampling rates (1s, 5s, 10s, 15s,\n20s) and coordinate processing methods. Results demonstrate that the 5-second\nsampling rate, utilizing 4.07% of available frames, provides the optimal\nbalance with a point retention rate of 64% and mean speed accuracy within 4.2%\nof the 1-second baseline while reducing computational overhead by 80.5%.\nComparative analysis of coordinate processing methods reveals that while UTM\nZone 33N projection and Haversine calculations provide consistently similar\nresults (within 0.1% difference), raw WGS84 coordinates underestimate distances\nby 15-30% and speeds by 20-35%. Altitude measurements showed unexpected\nresilience to sampling rate variations, with only 2.1% variation across all\nintervals. This research is the first of its kind, providing quantitative\nbenchmarks for establishing a robust framework for drone telemetry extraction\nand analysis using open-source tools and spatial libraries.\n","authors":["James E. Gallagher","Edward J. Oughton"],"pdf_url":"https://arxiv.org/pdf/2412.18505v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17737v2","updated":"2024-12-24T15:24:32Z","published":"2024-12-23T17:36:51Z","title":"Contextual Backpropagation Loops: Amplifying Deep Reasoning with\n Iterative Top-Down Feedback","summary":" Deep neural networks typically rely on a single forward pass for inference,\nwhich can limit their capacity to resolve ambiguous inputs. We introduce\nContextual Backpropagation Loops (CBLs) as an iterative mechanism that\nincorporates top-down feedback to refine intermediate representations, thereby\nimproving accuracy and robustness. This repeated process mirrors how humans\ncontinuously re-interpret sensory information in daily life-by checking and\nre-checking our perceptions using contextual cues. Our results suggest that\nCBLs can offer a straightforward yet powerful way to incorporate such\ncontextual reasoning in modern deep learning architectures.\n","authors":["Jacob Fein-Ashley"],"pdf_url":"https://arxiv.org/pdf/2412.17737v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18489v1","updated":"2024-12-24T15:22:10Z","published":"2024-12-24T15:22:10Z","title":"An Overview and Discussion of the Suitability of Existing Speech\n Datasets to Train Machine Learning Models for Collective Problem Solving","summary":" This report characterized the suitability of existing datasets for devising\nnew Machine Learning models, decision making methods, and analysis algorithms\nto improve Collaborative Problem Solving and then enumerated requirements for\nfuture datasets to be devised. Problem solving was assumed to be performed in\nteams of about three, four members, which talked to each other. A dataset\nconsists of the speech recordings of such teams. The characterization\nmethodology was based on metrics that capture cognitive, social, and emotional\nactivities and situations. The report presented the analysis of a large group\nof datasets developed for Spoken Language Understanding, a research area with\nsome similarity to Collaborative Problem Solving.\n","authors":["Gnaneswar Villuri","Alex Doboli"],"pdf_url":"https://arxiv.org/pdf/2412.18489v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.02425v2","updated":"2024-12-24T15:15:33Z","published":"2024-09-04T04:12:22Z","title":"Deep Adaptive Interest Network: Personalized Recommendation with\n Context-Aware Learning","summary":" In personalized recommendation systems, accurately capturing users' evolving\ninterests and combining them with contextual information is a critical research\narea. This paper proposes a novel model called the Deep Adaptive Interest\nNetwork (DAIN), which dynamically models users' interests while incorporating\ncontext-aware learning mechanisms to achieve precise and adaptive personalized\nrecommendations. DAIN leverages deep learning techniques to build an adaptive\ninterest network structure that can capture users' interest changes in\nreal-time while further optimizing recommendation results by integrating\ncontextual information. Experiments conducted on several public datasets\ndemonstrate that DAIN excels in both recommendation performance and\ncomputational efficiency. This research not only provides a new solution for\npersonalized recommendation systems but also offers fresh insights into the\napplication of context-aware learning in recommendation systems.\n","authors":["Shuaishuai Huang","Haowei Yang","You Yao","Xueting Lin","Yuming Tu"],"pdf_url":"https://arxiv.org/pdf/2409.02425v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.06736v3","updated":"2024-12-24T14:44:32Z","published":"2024-11-11T06:04:53Z","title":"MrSteve: Instruction-Following Agents in Minecraft with What-Where-When\n Memory","summary":" Significant advances have been made in developing general-purpose embodied AI\nin environments like Minecraft through the adoption of LLM-augmented\nhierarchical approaches. While these approaches, which combine high-level\nplanners with low-level controllers, show promise, low-level controllers\nfrequently become performance bottlenecks due to repeated failures. In this\npaper, we argue that the primary cause of failure in many low-level controllers\nis the absence of an episodic memory system. To address this, we introduce\nMrSteve (Memory Recall Steve-1), a novel low-level controller equipped with\nPlace Event Memory (PEM), a form of episodic memory that captures what, where,\nand when information from episodes. This directly addresses the main limitation\nof the popular low-level controller, Steve-1. Unlike previous models that rely\non short-term memory, PEM organizes spatial and event-based data, enabling\nefficient recall and navigation in long-horizon tasks. Additionally, we propose\nan Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing\nagents to alternate between exploration and task-solving based on recalled\nevents. Our approach significantly improves task-solving and exploration\nefficiency compared to existing methods. We will release our code and demos on\nthe project page: https://sites.google.com/view/mr-steve.\n","authors":["Junyeong Park","Junmo Cho","Sungjin Ahn"],"pdf_url":"https://arxiv.org/pdf/2411.06736v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18460v1","updated":"2024-12-24T14:39:47Z","published":"2024-12-24T14:39:47Z","title":"GeFL: Model-Agnostic Federated Learning with Generative Models","summary":" Federated learning (FL) is a promising paradigm in distributed learning while\npreserving the privacy of users. However, the increasing size of recent models\nmakes it unaffordable for a few users to encompass the model. It leads the\nusers to adopt heterogeneous models based on their diverse computing\ncapabilities and network bandwidth. Correspondingly, FL with heterogeneous\nmodels should be addressed, given that FL typically involves training a single\nglobal model. In this paper, we propose Generative Model-Aided Federated\nLearning (GeFL), incorporating a generative model that aggregates global\nknowledge across users of heterogeneous models. Our experiments on various\nclassification tasks demonstrate notable performance improvements of GeFL\ncompared to baselines, as well as limitations in terms of privacy and\nscalability. To tackle these concerns, we introduce a novel framework, GeFL-F.\nIt trains target networks aided by feature-generative models. We empirically\ndemonstrate the consistent performance gains of GeFL-F, while demonstrating\nbetter privacy preservation and robustness to a large number of clients. Codes\nare available at [1].\n","authors":["Honggu Kang","Seohyeon Cha","Joonhyuk Kang"],"pdf_url":"https://arxiv.org/pdf/2412.18460v1.pdf","comment":"20 pages"},{"id":"http://arxiv.org/abs/2412.18442v1","updated":"2024-12-24T14:02:44Z","published":"2024-12-24T14:02:44Z","title":"SoK: On the Offensive Potential of AI","summary":" Our society increasingly benefits from Artificial Intelligence (AI).\nUnfortunately, more and more evidence shows that AI is also used for offensive\npurposes. Prior works have revealed various examples of use cases in which the\ndeployment of AI can lead to violation of security and privacy objectives. No\nextant work, however, has been able to draw a holistic picture of the offensive\npotential of AI. In this SoK paper we seek to lay the ground for a systematic\nanalysis of the heterogeneous capabilities of offensive AI. In particular we\n(i) account for AI risks to both humans and systems while (ii) consolidating\nand distilling knowledge from academic literature, expert opinions, industrial\nvenues, as well as laymen -- all of which being valuable sources of information\non offensive AI.\n To enable alignment of such diverse sources of knowledge, we devise a common\nset of criteria reflecting essential technological factors related to offensive\nAI. With the help of such criteria, we systematically analyze: 95 research\npapers; 38 InfoSec briefings (from, e.g., BlackHat); the responses of a user\nstudy (N=549) entailing individuals with diverse backgrounds and expertise; and\nthe opinion of 12 experts. Our contributions not only reveal concerning ways\n(some of which overlooked by prior work) in which AI can be offensively used\ntoday, but also represent a foothold to address this threat in the years to\ncome.\n","authors":["Saskia Laura Schröer","Giovanni Apruzzese","Soheil Human","Pavel Laskov","Hyrum S. Anderson","Edward W. N. Bernroider","Aurore Fass","Ben Nassi","Vera Rimmer","Fabio Roli","Samer Salam","Ashley Shen","Ali Sunyaev","Tim Wadwha-Brown","Isabel Wagner","Gang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18442v1.pdf","comment":"Systemization of Knowledge (SoK) paper"},{"id":"http://arxiv.org/abs/2411.01904v3","updated":"2024-12-24T13:58:21Z","published":"2024-11-04T09:15:21Z","title":"FPPL: An Efficient and Non-IID Robust Federated Continual Learning\n Framework","summary":" Federated continual learning (FCL) aims to learn from sequential data stream\nin the decentralized federated learning setting, while simultaneously\nmitigating the catastrophic forgetting issue in classical continual learning.\nExisting FCL methods usually employ typical rehearsal mechanisms, which could\nresult in privacy violations or additional onerous storage and computational\nburdens. In this work, an efficient and non-IID robust federated continual\nlearning framework, called Federated Prototype-Augmented Prompt Learning\n(FPPL), is proposed. The FPPL can collaboratively learn lightweight prompts\naugmented by prototypes without rehearsal. On the client side, a fusion\nfunction is employed to fully leverage the knowledge contained in task-specific\nprompts for alleviating catastrophic forgetting. Additionally, global\nprototypes aggregated from the server are used to obtain unified representation\nthrough contrastive learning, mitigating the impact of non-IID-derived data\nheterogeneity. On the server side, locally uploaded prototypes are utilized to\nperform debiasing on the classifier, further alleviating the performance\ndegradation caused by both non-IID and catastrophic forgetting. Empirical\nevaluations demonstrate the effectiveness of FPPL, achieving notable\nperformance with an efficient design while remaining robust to diverse non-IID\ndegrees. Code is available at: https://github.com/ycheoo/FPPL.\n","authors":["Yuchen He","Chuyun Shen","Xiangfeng Wang","Bo Jin"],"pdf_url":"https://arxiv.org/pdf/2411.01904v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18437v1","updated":"2024-12-24T13:55:56Z","published":"2024-12-24T13:55:56Z","title":"MixMAS: A Framework for Sampling-Based Mixer Architecture Search for\n Multimodal Fusion and Learning","summary":" Choosing a suitable deep learning architecture for multimodal data fusion is\na challenging task, as it requires the effective integration and processing of\ndiverse data types, each with distinct structures and characteristics. In this\npaper, we introduce MixMAS, a novel framework for sampling-based mixer\narchitecture search tailored to multimodal learning. Our approach automatically\nselects the optimal MLP-based architecture for a given multimodal machine\nlearning (MML) task. Specifically, MixMAS utilizes a sampling-based\nmicro-benchmarking strategy to explore various combinations of\nmodality-specific encoders, fusion functions, and fusion networks,\nsystematically identifying the architecture that best meets the task's\nperformance metrics.\n","authors":["Abdelmadjid Chergui","Grigor Bezirganyan","Sana Sellami","Laure Berti-Équille","Sébastien Fournier"],"pdf_url":"https://arxiv.org/pdf/2412.18437v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.19926v2","updated":"2024-12-24T13:52:34Z","published":"2024-09-30T04:02:52Z","title":"Data-driven decision-making under uncertainty with entropic risk measure","summary":" The entropic risk measure is widely used in high-stakes decision making to\naccount for tail risks associated with an uncertain loss. With limited data,\nthe empirical entropic risk estimator, i.e. replacing the expectation in the\nentropic risk measure with a sample average, underestimates the true risk. To\ndebias the empirical entropic risk estimator, we propose a strongly\nasymptotically consistent bootstrapping procedure. The first step of the\nprocedure involves fitting a distribution to the data, whereas the second step\nestimates the bias of the empirical entropic risk estimator using\nbootstrapping, and corrects for it. We show that naively fitting a Gaussian\nMixture Model to the data using the maximum likelihood criterion typically\nleads to an underestimation of the risk. To mitigate this issue, we consider\ntwo alternative methods: a more computationally demanding one that fits the\ndistribution of empirical entropic risk, and a simpler one that fits the\nextreme value distribution. As an application of the approach, we study a\ndistributionally robust entropic risk minimization problem with type-$\\infty$\nWasserstein ambiguity set, where debiasing the validation performance using our\ntechniques significantly improves the calibration of the size of the ambiguity\nset. Furthermore, we propose a distributionally robust optimization model for a\nwell-studied insurance contract design problem. The model considers multiple\n(potential) policyholders that have dependent risks and the insurer and\npolicyholders use entropic risk measure. We show that cross validation methods\ncan result in significantly higher out-of-sample risk for the insurer if the\nbias in validation performance is not corrected for. This improvement can be\nexplained from the observation that our methods suggest a higher (and more\naccurate) premium to homeowners.\n","authors":["Utsav Sadana","Erick Delage","Angelos Georghiou"],"pdf_url":"https://arxiv.org/pdf/2409.19926v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.06746v3","updated":"2024-12-24T13:50:44Z","published":"2024-10-09T10:30:01Z","title":"Cluster-wise Graph Transformer with Dual-granularity Kernelized\n Attention","summary":" In the realm of graph learning, there is a category of methods that\nconceptualize graphs as hierarchical structures, utilizing node clustering to\ncapture broader structural information. While generally effective, these\nmethods often rely on a fixed graph coarsening routine, leading to overly\nhomogeneous cluster representations and loss of node-level information. In this\npaper, we envision the graph as a network of interconnected node sets without\ncompressing each cluster into a single embedding. To enable effective\ninformation transfer among these node sets, we propose the Node-to-Cluster\nAttention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple\nKernel Learning into the kernelized attention framework, effectively capturing\ninformation at both node and cluster levels. We then devise an efficient form\nfor N2C-Attn using the cluster-wise message-passing framework, achieving linear\ntime complexity. We further analyze how N2C-Attn combines bi-level feature maps\nof queries and keys, demonstrating its capability to merge dual-granularity\ninformation. The resulting architecture, Cluster-wise Graph Transformer\n(Cluster-GT), which uses node clusters as tokens and employs our proposed\nN2C-Attn module, shows superior performance on various graph-level tasks. Code\nis available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.\n","authors":["Siyuan Huang","Yunchong Song","Jiayue Zhou","Zhouhan Lin"],"pdf_url":"https://arxiv.org/pdf/2410.06746v3.pdf","comment":"Accepted as NeurIPS 2024 Spotlight"},{"id":"http://arxiv.org/abs/2412.18432v1","updated":"2024-12-24T13:49:02Z","published":"2024-12-24T13:49:02Z","title":"Gaussian entropic optimal transport: Schrödinger bridges and the\n Sinkhorn algorithm","summary":" Entropic optimal transport problems are regularized versions of optimal\ntransport problems. These models play an increasingly important role in machine\nlearning and generative modelling. For finite spaces, these problems are\ncommonly solved using Sinkhorn algorithm (a.k.a. iterative proportional fitting\nprocedure). However, in more general settings the Sinkhorn iterations are based\non nonlinear conditional/conjugate transformations and exact finite-dimensional\nsolutions cannot be computed. This article presents a finite-dimensional\nrecursive formulation of the iterative proportional fitting procedure for\ngeneral Gaussian multivariate models. As expected, this recursive formulation\nis closely related to the celebrated Kalman filter and related Riccati matrix\ndifference equations, and it yields algorithms that can be implemented in\npractical settings without further approximations. We extend this filtering\nmethodology to develop a refined and self-contained convergence analysis of\nGaussian Sinkhorn algorithms, including closed form expressions of entropic\ntransport maps and Schr\\\"odinger bridges.\n","authors":["O. Deniz Akyildiz","Pierre Del Moral","Joaquín Miguez"],"pdf_url":"https://arxiv.org/pdf/2412.18432v1.pdf","comment":"68 pages"},{"id":"http://arxiv.org/abs/2412.07514v2","updated":"2024-12-24T13:40:21Z","published":"2024-12-10T13:51:48Z","title":"Physics-Based Dynamic Models Hybridisation Using Physics-Informed Neural\n Networks","summary":" Physics-based dynamic models (PBDMs) are simplified representations of\ncomplex dynamical systems. PBDMs take specific processes within a complex\nsystem and assign a fragment of variables and an accompanying set of parameters\nto depict the processes. As this often leads to suboptimal parameterisation of\nthe system, a key challenge requires refining the empirical parameters and\nvariables to reduce uncertainties while maintaining the model s explainability\nand enhancing its predictive accuracy. We demonstrate that a hybrid mosquito\npopulation dynamics model, which integrates a PBDM with Physics-Informed Neural\nNetworks (PINN), retains the explainability of the PBDM by incorporating the\nPINN-learned model parameters in place of its empirical counterparts.\nSpecifically, we address the limitations of traditional PBDMs by modelling the\nparameters of larva and pupa development rates using a PINN that encodes\ncomplex, learned interactions of air temperature, precipitation and humidity.\nOur results demonstrate improved mosquito population simulations including the\ndifficult-to-predict mosquito population peaks. This opens the possibility of\nhybridisation concept application on other complex systems based on PBDMs such\nas cancer growth to address the challenges posed by scarce and noisy data, and\nto numerical weather prediction and climate modelling to overcome the gap\nbetween physics-based and data-driven weather prediction models.\n","authors":["Branislava Lalic","Dinh Viet Cuong","Mina Petric","Vladimir Pavlovic","Ana Firanj Sremac","Mark Roantree"],"pdf_url":"https://arxiv.org/pdf/2412.07514v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08309v3","updated":"2024-12-24T13:37:49Z","published":"2024-02-13T09:12:55Z","title":"Prompted Contextual Vectors for Spear-Phishing Detection","summary":" Spear-phishing attacks present a significant security challenge, with large\nlanguage models (LLMs) escalating the threat by generating convincing emails\nand facilitating target reconnaissance. To address this, we propose a detection\napproach based on a novel document vectorization method that utilizes an\nensemble of LLMs to create representation vectors. By prompting LLMs to reason\nand respond to human-crafted questions, we quantify the presence of common\npersuasion principles in the email's content, producing prompted contextual\ndocument vectors for a downstream supervised machine learning model. We\nevaluate our method using a unique dataset generated by a proprietary system\nthat automates target reconnaissance and spear-phishing email creation. Our\nmethod achieves a 91\\% F1 score in identifying LLM-generated spear-phishing\nemails, with the training set comprising only traditional phishing and benign\nemails. Key contributions include a novel document vectorization method\nutilizing LLM reasoning, a publicly available dataset of high-quality\nspear-phishing emails, and the demonstrated effectiveness of our method in\ndetecting such emails. This methodology can be utilized for various document\nclassification tasks, particularly in adversarial problem domains.\n","authors":["Daniel Nahmias","Gal Engelberg","Dan Klein","Asaf Shabtai"],"pdf_url":"https://arxiv.org/pdf/2402.08309v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.04739v2","updated":"2024-12-24T13:18:49Z","published":"2024-10-07T04:15:02Z","title":"TableRAG: Million-Token Table Understanding with Language Models","summary":" Recent advancements in language models (LMs) have notably enhanced their\nability to reason with tabular data, primarily through program-aided mechanisms\nthat manipulate and analyze tables. However, these methods often require the\nentire table as input, leading to scalability challenges due to the positional\nbias or context length constraints. In response to these challenges, we\nintroduce TableRAG, a Retrieval-Augmented Generation (RAG) framework\nspecifically designed for LM-based table understanding. TableRAG leverages\nquery expansion combined with schema and cell retrieval to pinpoint crucial\ninformation before providing it to the LMs. This enables more efficient data\nencoding and precise retrieval, significantly reducing prompt lengths and\nmitigating information loss. We have developed two new million-token benchmarks\nfrom the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's\neffectiveness at scale. Our results demonstrate that TableRAG's retrieval\ndesign achieves the highest retrieval quality, leading to the new\nstate-of-the-art performance on large-scale table understanding.\n","authors":["Si-An Chen","Lesly Miculicich","Julian Martin Eisenschlos","Zifeng Wang","Zilong Wang","Yanfei Chen","Yasuhisa Fujii","Hsuan-Tien Lin","Chen-Yu Lee","Tomas Pfister"],"pdf_url":"https://arxiv.org/pdf/2410.04739v2.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.16187v2","updated":"2024-12-24T13:04:45Z","published":"2024-12-13T06:00:27Z","title":"HashEvict: A Pre-Attention KV Cache Eviction Strategy using\n Locality-Sensitive Hashing","summary":" Transformer-based large language models (LLMs) use the key-value (KV) cache\nto significantly accelerate inference by storing the key and value embeddings\nof past tokens. However, this cache consumes significant GPU memory. In this\nwork, we introduce HashEvict, an algorithm that uses locality-sensitive hashing\n(LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache\nthat are cosine dissimilar to the current query token. This is achieved by\ncomputing the Hamming distance between binarized Gaussian projections of the\ncurrent token query and cached token keys, with a projection length much\nsmaller than the embedding dimension. We maintain a lightweight binary\nstructure in GPU memory to facilitate these calculations. Unlike existing\ncompression strategies that compute attention to determine token retention,\nHashEvict makes these decisions pre-attention, thereby reducing computational\ncosts. Additionally, HashEvict is dynamic - at every decoding step, the key and\nvalue of the current token replace the embeddings of a token expected to\nproduce the lowest attention score. We demonstrate that HashEvict can compress\nthe KV cache by 30%-70% while maintaining high performance across reasoning,\nmultiple-choice, long-context retrieval and summarization tasks.\n","authors":["Minghui Liu","Tahseen Rabbani","Tony O'Halloran","Ananth Sankaralingam","Mary-Anne Hartley","Brian Gravelle","Furong Huang","Cornelia Fermüller","Yiannis Aloimonos"],"pdf_url":"https://arxiv.org/pdf/2412.16187v2.pdf","comment":"10 pages, 6 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.18414v1","updated":"2024-12-24T13:03:33Z","published":"2024-12-24T13:03:33Z","title":"Discovery of 2D Materials via Symmetry-Constrained Diffusion Model","summary":" Generative model for 2D materials has shown significant promise in\naccelerating the material discovery process. The stability and performance of\nthese materials are strongly influenced by their underlying symmetry. However,\nexisting generative models for 2D materials often neglect symmetry constraints,\nwhich limits both the diversity and quality of the generated structures. Here,\nwe introduce a symmetry-constrained diffusion model (SCDM) that integrates\nspace group symmetry into the generative process. By incorporating Wyckoff\npositions, the model ensures adherence to symmetry principles, leading to the\ngeneration of 2,000 candidate structures. DFT calculations were conducted to\nevaluate the convex hull energies of these structures after structural\nrelaxation. From the generated samples, 843 materials that met the energy\nstability criteria (Ehull < 0.6 eV/atom) were identified. Among these, six\ncandidates were selected for further stability analysis, including phonon band\nstructure evaluations and electronic properties investigations, all of which\nexhibited phonon spectrum stability. To benchmark the performance of SCDM, a\nsymmetry-unconstrained diffusion model was also evaluated via crystal structure\nprediction model. The results highlight that incorporating symmetry constraints\nenhances the effectiveness of generated 2D materials, making a contribution to\nthe discovery of 2D materials through generative modeling.\n","authors":["Shihang Xu","Shibing Chu","Rami Mrad","Zhejun Zhang","Zhelin Li","Runxian Jiao","Yuanping Chen"],"pdf_url":"https://arxiv.org/pdf/2412.18414v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18407v1","updated":"2024-12-24T12:54:19Z","published":"2024-12-24T12:54:19Z","title":"A Statistical Framework for Ranking LLM-Based Chatbots","summary":" Large language models (LLMs) have transformed natural language processing,\nwith frameworks like Chatbot Arena providing pioneering platforms for\nevaluating these models. By facilitating millions of pairwise comparisons based\non human judgments, Chatbot Arena has become a cornerstone in LLM evaluation,\noffering rich datasets for ranking models in open-ended conversational tasks.\nBuilding upon this foundation, we propose a statistical framework that\nincorporates key advancements to address specific challenges in pairwise\ncomparison analysis. First, we introduce a factored tie model that enhances the\nability to handle ties -- an integral aspect of human-judged comparisons --\nsignificantly improving the model's fit to observed data. Second, we extend the\nframework to model covariance between competitors, enabling deeper insights\ninto performance relationships and facilitating intuitive groupings into\nperformance tiers. Third, we resolve optimization challenges arising from\nparameter non-uniqueness by introducing novel constraints, ensuring stable and\ninterpretable parameter estimation. Through rigorous evaluation and extensive\nexperimentation, our framework demonstrates substantial improvements over\nexisting methods in modeling pairwise comparison data. To support\nreproducibility and practical adoption, we release leaderbot, an open-source\nPython package implementing our models and analyses.\n","authors":["Siavash Ameli","Siyuan Zhuang","Ion Stoica","Michael W. Mahoney"],"pdf_url":"https://arxiv.org/pdf/2412.18407v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18404v1","updated":"2024-12-24T12:51:05Z","published":"2024-12-24T12:51:05Z","title":"Extract Free Dense Misalignment from CLIP","summary":" Recent vision-language foundation models still frequently produce outputs\nmisaligned with their inputs, evidenced by object hallucination in captioning\nand prompt misalignment in the text-to-image generation model. Recent studies\nhave explored methods for identifying misaligned elements, aiming not only to\nenhance interpretability but also to improve model performance. However,\ncurrent approaches primarily rely on large foundation models in a zero-shot\nmanner or fine-tuned models with human annotations, which limits scalability\ndue to significant computational costs. This work proposes a novel approach,\ndubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP,\nspecifically focusing on pinpointing misaligned words between image and text.\nWe carefully revamp the gradient-based attribution computation method, enabling\nnegative gradient of individual text tokens to indicate misalignment. We also\npropose F-CLIPScore, which aggregates misaligned attributions with a global\nalignment score. We evaluate our method on various dense misalignment detection\nbenchmarks, covering various image and text domains and misalignment types. Our\nmethod demonstrates state-of-the-art performance among zero-shot models and\ncompetitive performance with fine-tuned models while maintaining superior\nefficiency. Our qualitative examples show that our method has a unique strength\nto detect entity-level objects, intangible objects, and attributes that can not\nbe easily detected for existing works. We conduct ablation studies and analyses\nto highlight the strengths and limitations of our approach. Our code is\npublicly available at https://github.com/naver-ai/CLIP4DM.\n","authors":["JeongYeon Nam","Jinbae Im","Wonjae Kim","Taeho Kil"],"pdf_url":"https://arxiv.org/pdf/2412.18404v1.pdf","comment":"16 pages, 14 figures, AAAI 2025"},{"id":"http://arxiv.org/abs/2311.17303v3","updated":"2024-12-24T12:31:05Z","published":"2023-11-29T01:25:00Z","title":"Enhancing the Performance of Neural Networks Through Causal Discovery\n and Integration of Domain Knowledge","summary":" In this paper, we develop a generic methodology to encode hierarchical\ncausality structure among observed variables into a neural network in order to\nimprove its predictive performance. The proposed methodology, called\ncausality-informed neural network (CINN), leverages three coherent steps to\nsystematically map the structural causal knowledge into the layer-to-layer\ndesign of neural network while strictly preserving the orientation of every\ncausal relationship. In the first step, CINN discovers causal relationships\nfrom observational data via directed acyclic graph (DAG) learning, where causal\ndiscovery is recast as a continuous optimization problem to avoid the\ncombinatorial nature. In the second step, the discovered hierarchical causality\nstructure among observed variables is systematically encoded into neural\nnetwork through a dedicated architecture and customized loss function. By\ncategorizing variables in the causal DAG as root, intermediate, and leaf nodes,\nthe hierarchical causal DAG is translated into CINN with a one-to-one\ncorrespondence between nodes in the causal DAG and units in the CINN while\nmaintaining the relative order among these nodes. Regarding the loss function,\nboth intermediate and leaf nodes in the DAG graph are treated as target outputs\nduring CINN training so as to drive co-learning of causal relationships among\ndifferent types of nodes. As multiple loss components emerge in CINN, we\nleverage the projection of conflicting gradients to mitigate gradient\ninterference among the multiple learning tasks. Computational experiments\nacross a broad spectrum of UCI data sets demonstrate substantial advantages of\nCINN in predictive performance over other state-of-the-art methods. In\naddition, an ablation study underscores the value of integrating structural and\nquantitative causal knowledge in enhancing the neural network's predictive\nperformance incrementally.\n","authors":["Xiaoge Zhang","Xiao-Lin Wang","Fenglei Fan","Yiu-Ming Cheung","Indranil Bose"],"pdf_url":"https://arxiv.org/pdf/2311.17303v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18390v1","updated":"2024-12-24T12:28:19Z","published":"2024-12-24T12:28:19Z","title":"RDPM: Solve Diffusion Probabilistic Models via Recurrent Token\n Prediction","summary":" Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach\nfor high-fidelity image synthesis, operating diffusion processes on continuous\nVAE latent, which significantly differ from the text generation methods\nemployed by Large Language Models (LLMs). In this paper, we introduce a novel\ngenerative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which\nenhances the diffusion process through a recurrent token prediction mechanism,\nthereby pioneering the field of Discrete Diffusion. By progressively\nintroducing Gaussian noise into the latent representations of images and\nencoding them into vector-quantized tokens in a recurrent manner, RDPM\nfacilitates a unique diffusion process on discrete-value domains. This process\niteratively predicts the token codes for subsequent timesteps, transforming the\ninitial standard Gaussian noise into the source data distribution, aligning\nwith GPT-style models in terms of the loss function. RDPM demonstrates superior\nperformance while benefiting from the speed advantage of requiring only a few\ninference steps. This model not only leverages the diffusion process to ensure\nhigh-quality generation but also converts continuous signals into a series of\nhigh-fidelity discrete tokens, thereby maintaining a unified optimization\nstrategy with other discrete tokens, such as text. We anticipate that this work\nwill contribute to the development of a unified model for multimodal\ngeneration, specifically by integrating continuous signal domains such as\nimages, videos, and audio with text. We will release the code and model weights\nto the open-source community.\n","authors":["Wu Xiaoping","Hu Jie","Wei Xiaoming"],"pdf_url":"https://arxiv.org/pdf/2412.18390v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.18387v1","updated":"2024-12-24T12:20:24Z","published":"2024-12-24T12:20:24Z","title":"Weak Scaling Capability in Token Space: An Observation from Large Vision\n Language Model","summary":" The scaling capability has been widely validated with respect to the number\nof parameters and the size of training data. One important question that is\nunexplored is that does scaling capability also exists similarly with respect\nto the number of vision tokens? This study fills the gap by investigating the\nrelationship between the number of vision tokens and the performance of\nvision-language models. Our theoretical analysis and empirical evaluations\nreveal that the model exhibits weak scaling capabilities on the length \\(N_l\\),\nwith performance approximately \\(S(N_l) \\approx (c/N_l)^{\\alpha}\\), where \\(c,\n\\alpha\\) are hyperparameters. Interestingly, this scaling behavior remains\nlargely unaffected by the inclusion or exclusion of the user's question in the\ninput. Furthermore, fusing the user's question with the vision token can\nenhance model performance when the question is relevant to the task. To address\nthe computational challenges associated with large-scale vision tokens, we\npropose a novel architecture that efficiently reduces the token count while\nintegrating user question tokens into the representation. Our findings may\noffer insights for developing more efficient and effective vision-language\nmodels under specific task constraints.\n","authors":["Tenghui Li","Guoxu Zhou","Xuyang Zhao","Qibin Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.18387v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18377v1","updated":"2024-12-24T12:03:36Z","published":"2024-12-24T12:03:36Z","title":"ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with\n LLM-based Chatbots","summary":" The rise of LLMs has deflected a growing portion of human-computer\ninteractions towards LLM-based chatbots. The remarkable abilities of these\nmodels allow users to interact using long, diverse natural language text\ncovering a wide range of topics and styles. Phrasing these messages is a time\nand effort consuming task, calling for an autocomplete solution to assist\nusers. We introduce the task of chatbot interaction autocomplete. We present\nChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework\nfor LLM-based chatbot interactions. The framework includes a formal definition\nof the task, coupled with suitable datasets and metrics. We use the framework\nto evaluate After formally defining the task along with suitable datasets and\nmetrics, we test 9 models on the defined auto completion task, finding that\nwhile current off-the-shelf models perform fairly, there is still much room for\nimprovement, mainly in ranking of the generated suggestions. We provide\ninsights for practitioners working on this task and open new research\ndirections for researchers in the field. We release our framework to serve as a\nfoundation for future research.\n","authors":["Shani Goren","Oren Kalinsky","Tomer Stav","Yuri Rapoport","Yaron Fairstein","Ram Yazdy","Nachshon Cohen","Alexander Libov","Guy Kushilevitz"],"pdf_url":"https://arxiv.org/pdf/2412.18377v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18370v1","updated":"2024-12-24T11:53:24Z","published":"2024-12-24T11:53:24Z","title":"Unveiling the Threat of Fraud Gangs to Graph Neural Networks:\n Multi-Target Graph Injection Attacks against GNN-Based Fraud Detectors","summary":" Graph neural networks (GNNs) have emerged as an effective tool for fraud\ndetection, identifying fraudulent users, and uncovering malicious behaviors.\nHowever, attacks against GNN-based fraud detectors and their risks have rarely\nbeen studied, thereby leaving potential threats unaddressed. Recent findings\nsuggest that frauds are increasingly organized as gangs or groups. In this\nwork, we design attack scenarios where fraud gangs aim to make their fraud\nnodes misclassified as benign by camouflaging their illicit activities in\ncollusion. Based on these scenarios, we study adversarial attacks against\nGNN-based fraud detectors by simulating attacks of fraud gangs in three\nreal-world fraud cases: spam reviews, fake news, and medical insurance frauds.\nWe define these attacks as multi-target graph injection attacks and propose\nMonTi, a transformer-based Multi-target one-Time graph injection attack model.\nMonTi simultaneously generates attributes and edges of all attack nodes with a\ntransformer encoder, capturing interdependencies between attributes and edges\nmore effectively than most existing graph injection attack methods that\ngenerate these elements sequentially. Additionally, MonTi adaptively allocates\nthe degree budget for each attack node to explore diverse injection structures\ninvolving target, candidate, and attack nodes, unlike existing methods that fix\nthe degree budget across all attack nodes. Experiments show that MonTi\noutperforms the state-of-the-art graph injection attack methods on five\nreal-world graphs.\n","authors":["Jinhyeok Choi","Heehyeon Kim","Joyce Jiyoung Whang"],"pdf_url":"https://arxiv.org/pdf/2412.18370v1.pdf","comment":"19 pages, 5 figures, 12 tables, The 39th AAAI Conference on\n Artificial Intelligence (AAAI 2025)"},{"id":"http://arxiv.org/abs/2412.18365v1","updated":"2024-12-24T11:48:41Z","published":"2024-12-24T11:48:41Z","title":"Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges","summary":" Recent studies have shown that Hypergraph Neural Networks (HGNNs) are\nvulnerable to adversarial attacks. Existing approaches focus on hypergraph\nmodification attacks guided by gradients, overlooking node spanning in the\nhypergraph and the group identity of hyperedges, thereby resulting in limited\nattack performance and detectable attacks. In this manuscript, we present a\nnovel framework, i.e., Hypergraph Attacks via Injecting Homogeneous Nodes into\nElite Hyperedges (IE-Attack), to tackle these challenges. Initially, utilizing\nthe node spanning in the hypergraph, we propose the elite hyperedges sampler to\nidentify hyperedges to be injected. Subsequently, a node generator utilizing\nKernel Density Estimation (KDE) is proposed to generate the homogeneous node\nwith the group identity of hyperedges. Finally, by injecting the homogeneous\nnode into elite hyperedges, IE-Attack improves the attack performance and\nenhances the imperceptibility of attacks. Extensive experiments are conducted\non five authentic datasets to validate the effectiveness of IE-Attack and the\ncorresponding superiority to state-of-the-art methods.\n","authors":["Meixia He","Peican Zhu","Keke Tang","Yangming Guo"],"pdf_url":"https://arxiv.org/pdf/2412.18365v1.pdf","comment":"9 pages, The 39th Annual AAAI Conference on Artificial\n Intelligence(2025)"},{"id":"http://arxiv.org/abs/2312.13185v2","updated":"2024-12-24T11:47:48Z","published":"2023-12-20T16:54:05Z","title":"Measurement-based quantum computation from Clifford quantum cellular\n automata","summary":" Measurement-based quantum computation (MBQC) is a paradigm for quantum\ncomputation where computation is driven by local measurements on a suitably\nentangled resource state. In this work we show that MBQC is related to a model\nof quantum computation based on Clifford quantum cellular automata (CQCA).\nSpecifically, we show that certain MBQCs can be directly constructed from CQCAs\nwhich yields a simple and intuitive circuit model representation of MBQC in\nterms of quantum computation based on CQCA. We apply this description to\nconstruct various MBQC-based Ans\\\"atze for parameterized quantum circuits,\ndemonstrating that the different Ans\\\"atze may lead to significantly different\nperformances on different learning tasks. In this way, MBQC yields a family of\nHardware-efficient Ans\\\"atze that may be adapted to specific problem settings\nand is particularly well suited for architectures with translationally\ninvariant gates such as neutral atoms.\n","authors":["Hendrik Poulsen Nautrup","Hans J. Briegel"],"pdf_url":"https://arxiv.org/pdf/2312.13185v2.pdf","comment":"16 pages, 12 figures"},{"id":"http://arxiv.org/abs/2405.16771v2","updated":"2024-12-24T11:46:23Z","published":"2024-05-27T02:42:33Z","title":"ARC: A Generalist Graph Anomaly Detector with In-Context Learning","summary":" Graph anomaly detection (GAD), which aims to identify abnormal nodes that\ndiffer from the majority within a graph, has garnered significant attention.\nHowever, current GAD methods necessitate training specific to each dataset,\nresulting in high training costs, substantial data requirements, and limited\ngeneralizability when being applied to new datasets and domains. To address\nthese limitations, this paper proposes ARC, a generalist GAD approach that\nenables a ``one-for-all'' GAD model to detect anomalies across various graph\ndatasets on-the-fly. Equipped with in-context learning, ARC can directly\nextract dataset-specific patterns from the target dataset using few-shot normal\nsamples at the inference stage, without the need for retraining or fine-tuning\non the target dataset. ARC comprises three components that are well-crafted for\ncapturing universal graph anomaly patterns: 1) smoothness-based feature\nAlignment module that unifies the features of different datasets into a common\nand anomaly-sensitive space; 2) ego-neighbor Residual graph encoder that learns\nabnormality-related node embeddings; and 3) cross-attentive in-Context anomaly\nscoring module that predicts node abnormality by leveraging few-shot normal\nsamples. Extensive experiments on multiple benchmark datasets from various\ndomains demonstrate the superior anomaly detection performance, efficiency, and\ngeneralizability of ARC.\n","authors":["Yixin Liu","Shiyuan Li","Yu Zheng","Qingfeng Chen","Chengqi Zhang","Shirui Pan"],"pdf_url":"https://arxiv.org/pdf/2405.16771v2.pdf","comment":"25 pages, 10 figures"},{"id":"http://arxiv.org/abs/2412.18362v1","updated":"2024-12-24T11:44:58Z","published":"2024-12-24T11:44:58Z","title":"Point-DeepONet: A Deep Operator Network Integrating PointNet for\n Nonlinear Analysis of Non-Parametric 3D Geometries and Load Conditions","summary":" Nonlinear structural analyses in engineering often require extensive finite\nelement simulations, limiting their applicability in design optimization,\nuncertainty quantification, and real-time control. Conventional deep learning\nsurrogates, such as convolutional neural networks (CNNs), physics-informed\nneural networks (PINNs), and fourier neural operators (FNOs), face challenges\nwith complex non-parametric three-dimensional (3D) geometries, directionally\nvarying loads, and high-fidelity predictions on unstructured meshes. This work\npresents Point-DeepONet, an operator-learning-based surrogate that integrates\nPointNet into the DeepONet framework. By directly processing non-parametric\npoint clouds and incorporating signed distance functions (SDF) for geometric\ncontext, Point-DeepONet accurately predicts three-dimensional displacement and\nvon Mises stress fields without mesh parameterization or retraining. Trained\nusing only about 5,000 nodes (2.5% of the original 200,000-node mesh),\nPoint-DeepONet can still predict the entire mesh at high fidelity, achieving a\ncoefficient of determination reaching 0.987 for displacement and 0.923 for von\nMises stress under a horizontal load case. Compared to nonlinear finite element\nanalyses that require about 19.32 minutes per case, Point-DeepONet provides\npredictions in mere seconds-approximately 400 times faster-while maintaining\nexcellent scalability and accuracy with increasing dataset sizes. These\nfindings highlight the potential of Point-DeepONet to enable rapid,\nhigh-fidelity structural analyses, ultimately supporting more effective design\nexploration and informed decision-making in complex engineering workflows.\n","authors":["Jangseop Park","Namwoo Kang"],"pdf_url":"https://arxiv.org/pdf/2412.18362v1.pdf","comment":"23 pages, 16 figures, and 5 tables"},{"id":"http://arxiv.org/abs/2412.18355v1","updated":"2024-12-24T11:35:40Z","published":"2024-12-24T11:35:40Z","title":"Addressing Spatial-Temporal Data Heterogeneity in Federated Continual\n Learning via Tail Anchor","summary":" Federated continual learning (FCL) allows each client to continually update\nits knowledge from task streams, enhancing the applicability of federated\nlearning in real-world scenarios. However, FCL needs to address not only\nspatial data heterogeneity between clients but also temporal data heterogeneity\nbetween tasks. In this paper, empirical experiments demonstrate that such\ninput-level heterogeneity significantly affects the model's internal parameters\nand outputs, leading to severe spatial-temporal catastrophic forgetting of\nlocal and previous knowledge. To this end, we propose Federated Tail Anchor\n(FedTA) to mix trainable Tail Anchor with the frozen output features to adjust\ntheir position in the feature space, thereby overcoming parameter-forgetting\nand output-forgetting. Moreover, three novel components are also included in\nFedTA: Input Enhancement for improving the performance of pre-trained models on\ndownstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous\nlocal knowledge on the server side; and Best Global Prototype Selection for\nfinding the best anchor point for each class in the feature space. Extensive\nexperiments demonstrate that FedTA not only outperforms existing FCL methods\nbut also effectively preserves the relative positions of features, remaining\nunaffected by spatial and temporal changes.\n","authors":["Hao Yu","Xin Yang","Le Zhang","Hanlin Gu","Tianrui Li","Lixin Fan","Qiang Yang"],"pdf_url":"https://arxiv.org/pdf/2412.18355v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18975v2","updated":"2024-12-24T11:28:41Z","published":"2024-05-29T10:38:25Z","title":"Hierarchical Classification Auxiliary Network for Time Series\n Forecasting","summary":" Deep learning has significantly advanced time series forecasting through its\npowerful capacity to capture sequence relationships. However, training these\nmodels with the Mean Square Error (MSE) loss often results in over-smooth\npredictions, making it challenging to handle the complexity and learn\nhigh-entropy features from time series data with high variability and\nunpredictability. In this work, we introduce a novel approach by tokenizing\ntime series values to train forecasting models via cross-entropy loss, while\nconsidering the continuous nature of time series data. Specifically, we propose\na Hierarchical Classification Auxiliary Network, HCAN, a general model-agnostic\ncomponent that can be integrated with any forecasting model. HCAN is based on a\nHierarchy-Aware Attention module that integrates multi-granularity high-entropy\nfeatures at different hierarchy levels. At each level, we assign a class label\nfor timesteps to train an Uncertainty-Aware Classifier. This classifier\nmitigates the over-confidence in softmax loss via evidence theory. We also\nimplement a Hierarchical Consistency Loss to maintain prediction consistency\nacross hierarchy levels. Extensive experiments integrating HCAN with\nstate-of-the-art forecasting models demonstrate substantial improvements over\nbaselines on several real-world datasets.\n","authors":["Yanru Sun","Zongxia Xie","Dongyue Chen","Emadeldeen Eldele","Qinghua Hu"],"pdf_url":"https://arxiv.org/pdf/2405.18975v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18344v1","updated":"2024-12-24T11:02:38Z","published":"2024-12-24T11:02:38Z","title":"Predator Prey Scavenger Model using Holling's Functional Response of\n Type III and Physics-Informed Deep Neural Networks","summary":" Nonlinear mathematical models introduce the relation between various physical\nand biological interactions present in nature. One of the most famous models is\nthe Lotka-Volterra model which defined the interaction between predator and\nprey species present in nature. However, predators, scavengers, and prey\npopulations coexist in a natural system where scavengers can additionally rely\non the dead bodies of predators present in the system. Keeping this in mind,\nthe formulation and simulation of the predator prey scavenger model is\nintroduced in this paper. For the predation response, respective prey species\nare assumed to have Holling's functional response of type III. The proposed\nmodel is tested for various simulations and is found to be showing satisfactory\nresults in different scenarios. After simulations, the American forest dataset\nis taken for parameter estimation which imitates the real-world case. For\nparameter estimation, a physics-informed deep neural network is used with the\nAdam backpropagation method which prevents the avalanche effect in trainable\nparameters updation. For neural networks, mean square error and\nphysics-informed informed error are considered. After the neural network, the\nhence-found parameters are fine-tuned using the\nBroyden-Fletcher-Goldfarb-Shanno algorithm. Finally, the hence-found parameters\nusing a natural dataset are tested for stability using Jacobian stability\nanalysis. Future research work includes minimization of error induced by\nparameters, bifurcation analysis, and sensitivity analysis of the parameters.\n","authors":["Aneesh Panchal","Kirti Beniwal","Vivek Kumar"],"pdf_url":"https://arxiv.org/pdf/2412.18344v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18342v1","updated":"2024-12-24T11:00:23Z","published":"2024-12-24T11:00:23Z","title":"Mitigating Label Noise using Prompt-Based Hyperbolic Meta-Learning in\n Open-Set Domain Generalization","summary":" Open-Set Domain Generalization (OSDG) is a challenging task requiring models\nto accurately predict familiar categories while minimizing confidence for\nunknown categories to effectively reject them in unseen domains. While the OSDG\nfield has seen considerable advancements, the impact of label noise--a common\nissue in real-world datasets--has been largely overlooked. Label noise can\nmislead model optimization, thereby exacerbating the challenges of open-set\nrecognition in novel domains. In this study, we take the first step towards\naddressing Open-Set Domain Generalization under Noisy Labels (OSDG-NL) by\nconstructing dedicated benchmarks derived from widely used OSDG datasets,\nincluding PACS and DigitsDG. We evaluate baseline approaches by integrating\ntechniques from both label denoising and OSDG methodologies, highlighting the\nlimitations of existing strategies in handling label noise effectively. To\naddress these limitations, we propose HyProMeta, a novel framework that\nintegrates hyperbolic category prototypes for label noise-aware meta-learning\nalongside a learnable new-category agnostic prompt designed to enhance\ngeneralization to unseen classes. Our extensive experiments demonstrate the\nsuperior performance of HyProMeta compared to state-of-the-art methods across\nthe newly established benchmarks. The source code of this work is released at\nhttps://github.com/KPeng9510/HyProMeta.\n","authors":["Kunyu Peng","Di Wen","Sarfraz M. Saquib","Yufan Chen","Junwei Zheng","David Schneider","Kailun Yang","Jiamin Wu","Alina Roitberg","Rainer Stiefelhagen"],"pdf_url":"https://arxiv.org/pdf/2412.18342v1.pdf","comment":"The source code of this work is released at\n https://github.com/KPeng9510/HyProMeta"},{"id":"http://arxiv.org/abs/2411.15364v2","updated":"2024-12-24T10:57:49Z","published":"2024-11-22T22:13:40Z","title":"Exploring Facets of Language Generation in the Limit","summary":" The recent work of Kleinberg & Mullainathan [KM24] provides a concrete model\nfor language generation in the limit: given a sequence of examples from an\nunknown target language, the goal is to generate new examples from the target\nlanguage such that no incorrect examples are generated beyond some point. In\nsharp contrast to strong negative results for the closely related problem of\nlanguage identification, they establish positive results for language\ngeneration in the limit for all countable collections of languages. Follow-up\nwork by Raman & Tewari [RT24] studies bounds on the number of distinct inputs\nrequired by an algorithm before correct language generation is achieved --\nnamely, whether this is a constant for all languages in the collection (uniform\ngeneration) or a language-dependent constant (non-uniform generation).\n We show that every countable language collection has a generator which has\nthe stronger property of non-uniform generation in the limit. However, while\nthe generation algorithm of [KM24] can be implemented using membership queries,\nwe show that any algorithm cannot non-uniformly generate even for collections\nof just two languages, using only membership queries.\n We also formalize the tension between validity and breadth in the generation\nalgorithm of [KM24] by introducing a definition of exhaustive generation, and\nshow a strong negative result for exhaustive generation. Our result shows that\na tradeoff between validity and breadth is inherent for generation in the\nlimit. We also provide a precise characterization of the language collections\nfor which exhaustive generation is possible. Finally, inspired by algorithms\nthat can choose to obtain feedback, we consider a model of uniform generation\nwith feedback, completely characterizing language collections for which such\nuniform generation with feedback is possible in terms of a complexity measure\nof the collection.\n","authors":["Moses Charikar","Chirag Pabbaraju"],"pdf_url":"https://arxiv.org/pdf/2411.15364v2.pdf","comment":"31 pages. Fixed typos, updated related work, added results on\n characterization of exhaustive generation"},{"id":"http://arxiv.org/abs/2205.15128v4","updated":"2024-12-24T10:48:30Z","published":"2022-05-30T14:21:16Z","title":"Level Up with ML Vulnerability Identification: Leveraging Domain\n Constraints in Feature Space for Robust Android Malware Detection","summary":" Machine Learning (ML) promises to enhance the efficacy of Android Malware\nDetection (AMD); however, ML models are vulnerable to realistic evasion\nattacks--crafting realizable Adversarial Examples (AEs) that satisfy Android\nmalware domain constraints. To eliminate ML vulnerabilities, defenders aim to\nidentify susceptible regions in the feature space where ML models are prone to\ndeception. The primary approach to identifying vulnerable regions involves\ninvestigating realizable AEs, but generating these feasible apps poses a\nchallenge. For instance, previous work has relied on generating either\nfeature-space norm-bounded AEs or problem-space realizable AEs in adversarial\nhardening. The former is efficient but lacks full coverage of vulnerable\nregions while the latter can uncover these regions by satisfying domain\nconstraints but is known to be time-consuming. To address these limitations, we\npropose an approach to facilitate the identification of vulnerable regions.\nSpecifically, we introduce a new interpretation of Android domain constraints\nin the feature space, followed by a novel technique that learns them. Our\nempirical evaluations across various evasion attacks indicate effective\ndetection of AEs using learned domain constraints, with an average of 89.6%.\nFurthermore, extensive experiments on different Android malware detectors\ndemonstrate that utilizing our learned domain constraints in Adversarial\nTraining (AT) outperforms other AT-based defenses that rely on norm-bounded AEs\nor state-of-the-art non-uniform perturbations. Finally, we show that retraining\na malware detector with a wide variety of feature-space realizable AEs results\nin a 77.9% robustness improvement against realizable AEs generated by unknown\nproblem-space transformations, with up to 70x faster training than using\nproblem-space realizable AEs.\n","authors":["Hamid Bostani","Zhengyu Zhao","Zhuoran Liu","Veelasha Moonsamy"],"pdf_url":"https://arxiv.org/pdf/2205.15128v4.pdf","comment":"The paper was accepted by ACM Transactions on Privacy and Security on\n 2 December 2024"},{"id":"http://arxiv.org/abs/2408.02698v2","updated":"2024-12-24T10:45:47Z","published":"2024-08-04T15:01:52Z","title":"Applications of Scientific Machine Learning for the Analysis of\n Functionally Graded Porous Beams","summary":" This study investigates different Scientific Machine Learning (SciML)\napproaches for the analysis of functionally graded (FG) porous beams and\ncompares them under a new framework. The beam material properties are assumed\nto vary as an arbitrary continuous function. The methods consider the output of\na neural network/operator as an approximation to the displacement fields and\nderive the equations governing beam behavior based on the continuum\nformulation. The methods are implemented in the framework and formulated by\nthree approaches: (a) the vector approach leads to a Physics-Informed Neural\nNetwork (PINN), (b) the energy approach brings about the Deep Energy Method\n(DEM), and (c) the data-driven approach, which results in a class of Neural\nOperator methods. Finally, a neural operator has been trained to predict the\nresponse of the porous beam with functionally graded material under any\nporosity distribution pattern and any arbitrary traction condition. The results\nare validated with analytical and numerical reference solutions. The data and\ncode accompanying this manuscript will be publicly available at\nhttps://github.com/eshaghi-ms/DeepNetBeam.\n","authors":["Mohammad Sadegh Eshaghi","Mostafa Bamdad","Cosmin Anitescu","Yizheng Wang","Xiaoying Zhuang","Timon Rabczuk"],"pdf_url":"https://arxiv.org/pdf/2408.02698v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14095v2","updated":"2024-12-24T10:44:10Z","published":"2024-06-20T08:21:52Z","title":"Memory-Efficient Gradient Unrolling for Large-Scale Bi-level\n Optimization","summary":" Bi-level optimization (BO) has become a fundamental mathematical framework\nfor addressing hierarchical machine learning problems. As deep learning models\ncontinue to grow in size, the demand for scalable bi-level optimization\nsolutions has become increasingly critical. Traditional gradient-based bi-level\noptimization algorithms, due to their inherent characteristics, are ill-suited\nto meet the demands of large-scale applications. In this paper, we introduce\n$\\textbf{F}$orward $\\textbf{G}$radient $\\textbf{U}$nrolling with\n$\\textbf{F}$orward $\\textbf{F}$radient, abbreviated as\n$(\\textbf{FG})^2\\textbf{U}$, which achieves an unbiased stochastic\napproximation of the meta gradient for bi-level optimization.\n$(\\text{FG})^2\\text{U}$ circumvents the memory and approximation issues\nassociated with classical bi-level optimization approaches, and delivers\nsignificantly more accurate gradient estimates than existing large-scale\nbi-level optimization approaches. Additionally, $(\\text{FG})^2\\text{U}$ is\ninherently designed to support parallel computing, enabling it to effectively\nleverage large-scale distributed computing systems to achieve significant\ncomputational efficiency. In practice, $(\\text{FG})^2\\text{U}$ and other\nmethods can be strategically placed at different stages of the training process\nto achieve a more cost-effective two-phase paradigm. Further,\n$(\\text{FG})^2\\text{U}$ is easy to implement within popular deep learning\nframeworks, and can be conveniently adapted to address more challenging\nzeroth-order bi-level optimization scenarios. We provide a thorough convergence\nanalysis and a comprehensive practical discussion for $(\\text{FG})^2\\text{U}$,\ncomplemented by extensive empirical evaluations, showcasing its superior\nperformance in diverse large-scale bi-level optimization tasks. Code is\navailable at https://github.com/ShenQianli/FG2U.\n","authors":["Qianli Shen","Yezhen Wang","Zhouhao Yang","Xiang Li","Haonan Wang","Yang Zhang","Jonathan Scarlett","Zhanxing Zhu","Kenji Kawaguchi"],"pdf_url":"https://arxiv.org/pdf/2406.14095v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.16540v3","updated":"2024-12-24T10:42:39Z","published":"2024-06-24T11:20:44Z","title":"Improving robustness to corruptions with multiplicative weight\n perturbations","summary":" Deep neural networks (DNNs) excel on clean images but struggle with corrupted\nones. Incorporating specific corruptions into the data augmentation pipeline\ncan improve robustness to those corruptions but may harm performance on clean\nimages and other types of distortion. In this paper, we introduce an\nalternative approach that improves the robustness of DNNs to a wide range of\ncorruptions without compromising accuracy on clean images. We first demonstrate\nthat input perturbations can be mimicked by multiplicative perturbations in the\nweight space. Leveraging this, we propose Data Augmentation via Multiplicative\nPerturbation (DAMP), a training method that optimizes DNNs under random\nmultiplicative weight perturbations. We also examine the recently proposed\nAdaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs\nunder adversarial multiplicative weight perturbations. Experiments on image\nclassification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural\nnetwork architectures (ResNet50, ViT-S/16, ViT-B/16) show that DAMP enhances\nmodel generalization performance in the presence of corruptions across\ndifferent settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from\nscratch, reaching the top-1 error of 23.7% which is comparable to ResNet50\nwithout extensive data augmentations.\n","authors":["Trung Trinh","Markus Heinonen","Luigi Acerbi","Samuel Kaski"],"pdf_url":"https://arxiv.org/pdf/2406.16540v3.pdf","comment":"Published at NeurIPS 2024 (spotlight). Code is available at\n https://github.com/trungtrinh44/DAMP"},{"id":"http://arxiv.org/abs/2404.19165v2","updated":"2024-12-24T10:41:24Z","published":"2024-04-30T00:02:34Z","title":"DelGrad: Exact event-based gradients in spiking networks for training\n delays and weights","summary":" Spiking neural networks (SNNs) inherently rely on the timing of signals for\nrepresenting and processing information. Incorporating trainable transmission\ndelays, alongside synaptic weights, is crucial for shaping these temporal\ndynamics. While recent methods have shown the benefits of training delays and\nweights in terms of accuracy and memory efficiency, they rely on discrete time,\napproximate gradients, and full access to internal variables like membrane\npotentials. This limits their precision, efficiency, and suitability for\nneuromorphic hardware due to increased memory requirements and I/O bandwidth\ndemands. To address these challenges, we propose DelGrad, an analytical,\nevent-based method to compute exact loss gradients for both synaptic weights\nand delays. The inclusion of delays in the training process emerges naturally\nwithin our proposed formalism, enriching the model's search space with a\ntemporal dimension. Moreover, DelGrad, grounded purely in spike timing,\neliminates the need to track additional variables such as membrane potentials.\nTo showcase this key advantage, we demonstrate the functionality and benefits\nof DelGrad on the BrainScaleS-2 neuromorphic platform, by training SNNs in a\nchip-in-the-loop fashion. For the first time, we experimentally demonstrate the\nmemory efficiency and accuracy benefits of adding delays to SNNs on noisy\nmixed-signal hardware. Additionally, these experiments also reveal the\npotential of delays for stabilizing networks against noise. DelGrad opens a new\nway for training SNNs with delays on neuromorphic hardware, which results in\nless number of required parameters, higher accuracy and ease of hardware\ntraining.\n","authors":["Julian Göltz","Jimmy Weber","Laura Kriener","Sebastian Billaudelle","Peter Lake","Johannes Schemmel","Melika Payvand","Mihai A. Petrovici"],"pdf_url":"https://arxiv.org/pdf/2404.19165v2.pdf","comment":"22 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.18322v1","updated":"2024-12-24T10:17:14Z","published":"2024-12-24T10:17:14Z","title":"Exploring Graph Mamba: A Comprehensive Survey on State-Space Models for\n Graph Learning","summary":" Graph Mamba, a powerful graph embedding technique, has emerged as a\ncornerstone in various domains, including bioinformatics, social networks, and\nrecommendation systems. This survey represents the first comprehensive study\ndevoted to Graph Mamba, to address the critical gaps in understanding its\napplications, challenges, and future potential. We start by offering a detailed\nexplanation of the original Graph Mamba architecture, highlighting its key\ncomponents and underlying mechanisms. Subsequently, we explore the most recent\nmodifications and enhancements proposed to improve its performance and\napplicability. To demonstrate the versatility of Graph Mamba, we examine its\napplications across diverse domains. A comparative analysis of Graph Mamba and\nits variants is conducted to shed light on their unique characteristics and\npotential use cases. Furthermore, we identify potential areas where Graph Mamba\ncan be applied in the future, highlighting its potential to revolutionize data\nanalysis in these fields. Finally, we address the current limitations and open\nresearch questions associated with Graph Mamba. By acknowledging these\nchallenges, we aim to stimulate further research and development in this\npromising area. This survey serves as a valuable resource for both newcomers\nand experienced researchers seeking to understand and leverage the power of\nGraph Mamba.\n","authors":["Safa Ben Atitallah","Chaima Ben Rabah","Maha Driss","Wadii Boulila","Anis Koubaa"],"pdf_url":"https://arxiv.org/pdf/2412.18322v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18316v1","updated":"2024-12-24T10:04:19Z","published":"2024-12-24T10:04:19Z","title":"Data-Driven Self-Supervised Graph Representation Learning","summary":" Self-supervised graph representation learning (SSGRL) is a representation\nlearning paradigm used to reduce or avoid manual labeling. An essential part of\nSSGRL is graph data augmentation. Existing methods usually rely on heuristics\ncommonly identified through trial and error and are effective only within some\napplication domains. Also, it is not clear why one heuristic is better than\nanother. Moreover, recent studies have argued against some techniques (e.g.,\ndropout: that can change the properties of molecular graphs or destroy relevant\nsignals for graph-based document classification tasks).\n In this study, we propose a novel data-driven SSGRL approach that\nautomatically learns a suitable graph augmentation from the signal encoded in\nthe graph (i.e., the nodes' predictive feature and topological information). We\npropose two complementary approaches that produce learnable feature and\ntopological augmentations. The former learns multi-view augmentation of node\nfeatures, and the latter learns a high-order view of the topology. Moreover,\nthe augmentations are jointly learned with the representation. Our approach is\ngeneral that it can be applied to homogeneous and heterogeneous graphs. We\nperform extensive experiments on node classification (using nine homogeneous\nand heterogeneous datasets) and graph property prediction (using another eight\ndatasets). The results show that the proposed method matches or outperforms the\nSOTA SSGRL baselines and performs similarly to semi-supervised methods. The\nanonymised source code is available at https://github.com/AhmedESamy/dsgrl/\n","authors":["Ahmed E. Samy","Zekarias T. Kefatoa","Sarunas Girdzijauskasa"],"pdf_url":"https://arxiv.org/pdf/2412.18316v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.16534v3","updated":"2024-12-24T09:32:02Z","published":"2023-08-31T08:25:47Z","title":"Zero-Shot Conditioning of Score-Based Diffusion Models by Neuro-Symbolic\n Constraints","summary":" Score-based diffusion models have emerged as effective approaches for both\nconditional and unconditional generation. Still conditional generation is based\non either a specific training of a conditional model or classifier guidance,\nwhich requires training a noise-dependent classifier, even when a classifier\nfor uncorrupted data is given. We propose a method that, given a pre-trained\nunconditional score-based generative model, samples from the conditional\ndistribution under arbitrary logical constraints, without requiring additional\ntraining. Differently from other zero-shot techniques, that rather aim at\ngenerating valid conditional samples, our method is designed for approximating\nthe true conditional distribution. Firstly, we show how to manipulate the\nlearned score in order to sample from an un-normalized distribution conditional\non a user-defined constraint. Then, we define a flexible and numerically stable\nneuro-symbolic framework for encoding soft logical constraints. Combining these\ntwo ingredients we obtain a general, but approximate, conditional sampling\nalgorithm. We further developed effective heuristics aimed at improving the\napproximation. Finally, we show the effectiveness of our approach in\napproximating conditional distributions for various types of constraints and\ndata: tabular data, images and time series.\n","authors":["Davide Scassola","Sebastiano Saccani","Ginevra Carbone","Luca Bortolussi"],"pdf_url":"https://arxiv.org/pdf/2308.16534v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09059v3","updated":"2024-12-24T09:30:14Z","published":"2024-12-12T08:40:22Z","title":"Go With the Flow: Fast Diffusion for Gaussian Mixture Models","summary":" Schr\\\"{o}dinger Bridges (SB) are diffusion processes that steer, in finite\ntime, a given initial distribution to another final one while minimizing a\nsuitable cost functional. Although various methods for computing SBs have\nrecently been proposed in the literature, most of these approaches require\ncomputationally expensive training schemes, even for solving low-dimensional\nproblems. In this work, we propose an analytic parametrization of a set of\nfeasible policies for steering the distribution of a dynamical system from one\nGaussian Mixture Model (GMM) to another. Instead of relying on standard\nnon-convex optimization techniques, the optimal policy within the set can be\napproximated as the solution of a low-dimensional linear program whose\ndimension scales linearly with the number of components in each mixture.\nFurthermore, our method generalizes naturally to more general classes of\ndynamical systems such as controllable Linear Time-Varying systems that cannot\ncurrently be solved using traditional neural SB approaches. We showcase the\npotential of this approach in low-to-moderate dimensional problems such as\nimage-to-image translation in the latent space of an autoencoder, and various\nother examples. We also benchmark our approach on an Entropic Optimal Transport\n(EOT) problem and show that it outperforms state-of-the-art methods in cases\nwhere the boundary distributions are mixture models while requiring virtually\nno training.\n","authors":["George Rapakoulias","Ali Reza Pedram","Panagiotis Tsiotras"],"pdf_url":"https://arxiv.org/pdf/2412.09059v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18302v1","updated":"2024-12-24T09:11:37Z","published":"2024-12-24T09:11:37Z","title":"FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models","summary":" Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the\ngeneration of high-quality images that align closely with textual descriptions.\nHowever, this progress has also raised concerns about their misuse for\npropaganda and other malicious activities. Recent studies reveal that attackers\ncan embed biases into these models through simple fine-tuning, causing them to\ngenerate targeted imagery when triggered by specific phrases. This underscores\nthe potential for T2I models to act as tools for disseminating propaganda,\nproducing images aligned with an attacker's objective for end-users.\n Building on this concept, we introduce FameBias, a T2I biasing attack that\nmanipulates the embeddings of input prompts to generate images featuring\nspecific public figures. Unlike prior methods, Famebias operates solely on the\ninput embedding vectors without requiring additional model training. We\nevaluate FameBias comprehensively using Stable Diffusion V2, generating a large\ncorpus of images based on various trigger nouns and target public figures. Our\nexperiments demonstrate that FameBias achieves a high attack success rate while\npreserving the semantic context of the original prompts across multiple\ntrigger-target pairs.\n","authors":["Jaechul Roh","Andrew Yuan","Jinsong Mao"],"pdf_url":"https://arxiv.org/pdf/2412.18302v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.07189v4","updated":"2024-12-24T09:06:15Z","published":"2023-03-13T15:30:28Z","title":"Optimizing Convolutional Neural Networks for Chronic Obstructive\n Pulmonary Disease Detection in Clinical Computed Tomography Imaging","summary":" We aim to optimize the binary detection of Chronic Obstructive Pulmonary\nDisease (COPD) based on emphysema presence in the lung with convolutional\nneural networks (CNN) by exploring manually adjusted versus automated\nwindow-setting optimization (WSO) on computed tomography (CT) images. 7,194 CT\nimages (3,597 with COPD; 3,597 healthy controls) from 78 subjects were selected\nretrospectively (10.2018-12.2021) and preprocessed. For each image, intensity\nvalues were manually clipped to the emphysema window setting and a baseline\n'full-range' window setting. Class-balanced train, validation, and test sets\ncontained 3,392, 1,114, and 2,688 images. The network backbone was optimized by\ncomparing various CNN architectures. Furthermore, automated WSO was implemented\nby adding a customized layer to the model. The image-level area under the\nReceiver Operating Characteristics curve (AUC) [lower, upper limit 95%\nconfidence] was utilized to compare model variations. Repeated inference (n=7)\non the test set showed that the DenseNet was the most efficient backbone and\nachieved a mean AUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input\nimages manually adjusted to the emphysema window, the DenseNet model predicted\nCOPD with a mean AUC of 0.86 [0.82, 0.89]. By adding a customized WSO layer to\nthe DenseNet, an optimal window in the proximity of the emphysema window\nsetting was learned automatically, and a mean AUC of 0.82 [0.78, 0.86] was\nachieved. Detection of COPD with DenseNet models was improved by WSO of CT data\nto the emphysema window setting range.\n","authors":["Tina Dorosti","Manuel Schultheiss","Felix Hofmann","Johannes Thalhammer","Luisa Kirchner","Theresa Urban","Franz Pfeiffer","Florian Schaff","Tobias Lasser","Daniela Pfeiffer"],"pdf_url":"https://arxiv.org/pdf/2303.07189v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17623v2","updated":"2024-12-24T09:05:53Z","published":"2024-12-23T14:48:32Z","title":"Towards An Unsupervised Learning Scheme for Efficiently Solving\n Parameterized Mixed-Integer Programs","summary":" In this paper, we describe a novel unsupervised learning scheme for\naccelerating the solution of a family of mixed integer programming (MIP)\nproblems. Distinct substantially from existing learning-to-optimize methods,\nour proposal seeks to train an autoencoder (AE) for binary variables in an\nunsupervised learning fashion, using data of optimal solutions to historical\ninstances for a parametric family of MIPs. By a deliberate design of AE\narchitecture and exploitation of its statistical implication, we present a\nsimple and straightforward strategy to construct a class of cutting plane\nconstraints from the decoder parameters of an offline-trained AE. These\nconstraints reliably enclose the optimal binary solutions of new problem\ninstances thanks to the representation strength of the AE. More importantly,\ntheir integration into the primal MIP problem leads to a tightened MIP with the\nreduced feasible region, which can be resolved at decision time using\noff-the-shelf solvers with much higher efficiency. Our method is applied to a\nbenchmark batch process scheduling problem formulated as a mixed integer linear\nprogramming (MILP) problem. Comprehensive results demonstrate that our approach\nsignificantly reduces the computational cost of off-the-shelf MILP solvers\nwhile retaining a high solution quality. The codes of this work are\nopen-sourced at https://github.com/qushiyuan/AE4BV.\n","authors":["Shiyuan Qu","Fenglian Dong","Zhiwei Wei","Chao Shang"],"pdf_url":"https://arxiv.org/pdf/2412.17623v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18298v1","updated":"2024-12-24T09:05:37Z","published":"2024-12-24T09:05:37Z","title":"Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight","summary":" Video anomaly detection (VAD) has witnessed significant advancements through\nthe integration of large language models (LLMs) and vision-language models\n(VLMs), addressing critical challenges such as interpretability, temporal\nreasoning, and generalization in dynamic, open-world scenarios. This paper\npresents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024,\nfocusing on four key aspects: (i) enhancing interpretability through semantic\ninsights and textual explanations, making visual anomalies more understandable;\n(ii) capturing intricate temporal relationships to detect and localize dynamic\nanomalies across video frames; (iii) enabling few-shot and zero-shot detection\nto minimize reliance on large, annotated datasets; and (iv) addressing\nopen-world and class-agnostic anomalies by using semantic understanding and\nmotion features for spatiotemporal coherence. We highlight their potential to\nredefine the landscape of VAD. Additionally, we explore the synergy between\nvisual and textual modalities offered by LLMs and VLMs, highlighting their\ncombined strengths and proposing future directions to fully exploit the\npotential in enhancing video anomaly detection.\n","authors":["Xi Ding","Lei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.18298v1.pdf","comment":"Research report"},{"id":"http://arxiv.org/abs/2412.18297v1","updated":"2024-12-24T09:05:06Z","published":"2024-12-24T09:05:06Z","title":"Learning to Play Against Unknown Opponents","summary":" We consider the problem of a learning agent who has to repeatedly play a\ngeneral sum game against a strategic opponent who acts to maximize their own\npayoff by optimally responding against the learner's algorithm. The learning\nagent knows their own payoff function, but is uncertain about the payoff of\ntheir opponent (knowing only that it is drawn from some distribution\n$\\mathcal{D}$). What learning algorithm should the agent run in order to\nmaximize their own total utility?\n We demonstrate how to construct an $\\varepsilon$-optimal learning algorithm\n(obtaining average utility within $\\varepsilon$ of the optimal utility) for\nthis problem in time polynomial in the size of the input and $1/\\varepsilon$\nwhen either the size of the game or the support of $\\mathcal{D}$ is constant.\nWhen the learning algorithm is further constrained to be a no-regret algorithm,\nwe demonstrate how to efficiently construct an optimal learning algorithm\n(asymptotically achieving the optimal utility) in polynomial time, independent\nof any other assumptions. Both results make use of recently developed machinery\nthat converts the analysis of learning algorithms to the study of the class of\ncorresponding geometric objects known as menus.\n","authors":["Eshwar Ram Arunachaleswaran","Natalie Collina","Jon Schneider"],"pdf_url":"https://arxiv.org/pdf/2412.18297v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18296v1","updated":"2024-12-24T09:04:06Z","published":"2024-12-24T09:04:06Z","title":"Navigating Data Corruption in Machine Learning: Balancing Quality,\n Quantity, and Imputation Strategies","summary":" Data corruption, including missing and noisy data, poses significant\nchallenges in real-world machine learning. This study investigates the effects\nof data corruption on model performance and explores strategies to mitigate\nthese effects through two experimental setups: supervised learning with NLP\ntasks (NLP-SL) and deep reinforcement learning for traffic signal optimization\n(Signal-RL). We analyze the relationship between data corruption levels and\nmodel performance, evaluate the effectiveness of data imputation methods, and\nassess the utility of enlarging datasets to address data corruption.\n Our results show that model performance under data corruption follows a\ndiminishing return curve, modeled by the exponential function. Missing data,\nwhile detrimental, is less harmful than noisy data, which causes severe\nperformance degradation and training instability, particularly in sequential\ndecision-making tasks like Signal-RL. Imputation strategies involve a\ntrade-off: they recover missing information but may introduce noise. Their\neffectiveness depends on imputation accuracy and corruption ratio. We identify\ndistinct regions in the imputation advantage heatmap, including an \"imputation\nadvantageous corner\" and an \"imputation disadvantageous edge\" and classify\ntasks as \"noise-sensitive\" or \"noise-insensitive\" based on their decision\nboundaries.\n Furthermore, we find that increasing dataset size mitigates but cannot fully\novercome the effects of data corruption. The marginal utility of additional\ndata diminishes as corruption increases. An empirical rule emerges:\napproximately 30% of the data is critical for determining performance, while\nthe remaining 70% has minimal impact.\n These findings provide actionable insights into data preprocessing,\nimputation strategies, and data collection practices, guiding the development\nof robust machine learning systems in noisy environments.\n","authors":["Qi Liu","Wanjing Ma"],"pdf_url":"https://arxiv.org/pdf/2412.18296v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.11465v3","updated":"2024-12-24T09:03:02Z","published":"2024-11-18T10:58:46Z","title":"Re-examining learning linear functions in context","summary":" In-context learning (ICL) has emerged as a powerful paradigm for easily\nadapting Large Language Models (LLMs) to various tasks. However, our\nunderstanding of how ICL works remains limited. We explore a simple model of\nICL in a controlled setup with synthetic training data to investigate ICL of\nunivariate linear functions. We experiment with a range of GPT-2-like\ntransformer models trained from scratch. Our findings challenge the prevailing\nnarrative that transformers adopt algorithmic approaches like linear regression\nto learn a linear function in-context. These models fail to generalize beyond\ntheir training distribution, highlighting fundamental limitations in their\ncapacity to infer abstract task structures. Our experiments lead us to propose\na mathematically precise hypothesis of what the model might be learning.\n","authors":["Omar Naim","Guilhem Fouilhé","Nicholas Asher"],"pdf_url":"https://arxiv.org/pdf/2411.11465v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16534v2","updated":"2024-12-24T09:00:20Z","published":"2024-12-21T08:25:23Z","title":"DOFEN: Deep Oblivious Forest ENsemble","summary":" Deep Neural Networks (DNNs) have revolutionized artificial intelligence,\nachieving impressive results on diverse data types, including images, videos,\nand texts. However, DNNs still lag behind Gradient Boosting Decision Trees\n(GBDT) on tabular data, a format extensively utilized across various domains.\nIn this paper, we propose DOFEN, short for \\textbf{D}eep \\textbf{O}blivious\n\\textbf{F}orest \\textbf{EN}semble, a novel DNN architecture inspired by\noblivious decision trees. DOFEN constructs relaxed oblivious decision trees\n(rODTs) by randomly combining conditions for each column and further enhances\nperformance with a two-level rODT forest ensembling process. By employing this\napproach, DOFEN achieves state-of-the-art results among DNNs and further\nnarrows the gap between DNNs and tree-based models on the well-recognized\nbenchmark: Tabular Benchmark \\citep{grinsztajn2022tree}, which includes 73\ntotal datasets spanning a wide array of domains. The code of DOFEN is available\nat: \\url{https://github.com/Sinopac-Digital-Technology-Division/DOFEN}.\n","authors":["Kuan-Yu Chen","Ping-Han Chiang","Hsin-Rung Chou","Chih-Sheng Chen","Tien-Hao Chang"],"pdf_url":"https://arxiv.org/pdf/2412.16534v2.pdf","comment":"NeurIPS 2024 (poster); (v2: modify and rearrange sections, propose\n multihead extension of DOFEN, include new results on tabular benchmark and\n other benchmarks)"},{"id":"http://arxiv.org/abs/2412.18291v1","updated":"2024-12-24T08:53:54Z","published":"2024-12-24T08:53:54Z","title":"DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation","summary":" Code review is a vital but demanding aspect of software development,\ngenerating significant interest in automating review comments. Traditional\nevaluation methods for these comments, primarily based on text similarity, face\ntwo major challenges: inconsistent reliability of human-authored comments in\nopen-source projects and the weak correlation of text similarity with\nobjectives like enhancing code quality and detecting defects.\n This study empirically analyzes benchmark comments using a novel set of\ncriteria informed by prior research and developer interviews. We then similarly\nrevisit the evaluation of existing methodologies. Our evaluation framework,\nDeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a\ncomprehensive reassessment of current techniques based on the criteria set.\nBesides, we also introduce an innovative and efficient baseline, LLM-Reviewer,\nleveraging the few-shot learning capabilities of LLMs for a target-oriented\ncomparison.\n Our research highlights the limitations of text similarity metrics, finding\nthat less than 10% of benchmark comments are high quality for automation. In\ncontrast, DeepCRCEval effectively distinguishes between high and low-quality\ncomments, proving to be a more reliable evaluation mechanism. Incorporating LLM\nevaluators into DeepCRCEval significantly boosts efficiency, reducing time and\ncost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates\nsignificant potential of focusing task real targets in comment generation.\n","authors":["Junyi Lu","Xiaojia Li","Zihan Hua","Lei Yu","Shiqi Cheng","Li Yang","Fengjun Zhang","Chun Zuo"],"pdf_url":"https://arxiv.org/pdf/2412.18291v1.pdf","comment":"Accepted to the 28th International Conference on Fundamental\n Approaches to Software Engineering (FASE 2025), part of the 28th European\n Joint Conferences on Theory and Practice of Software (ETAPS 2025)"},{"id":"http://arxiv.org/abs/2412.18290v1","updated":"2024-12-24T08:53:30Z","published":"2024-12-24T08:53:30Z","title":"Dissipation alters modes of information encoding in small quantum\n reservoirs near criticality","summary":" Quantum reservoir computing (QRC) has emerged as a promising paradigm for\nharnessing near-term quantum devices to tackle temporal machine learning tasks.\nYet identifying the mechanisms that underlie enhanced performance remains\nchallenging, particularly in many-body open systems where nonlinear\ninteractions and dissipation intertwine in complex ways. Here, we investigate a\nminimal model of a driven-dissipative quantum reservoir described by two\ncoupled Kerr-nonlinear oscillators, an experimentally realizable platform that\nfeatures controllable coupling, intrinsic nonlinearity, and tunable photon\nloss. Using Partial Information Decomposition (PID), we examine how different\ndynamical regimes encode input drive signals in terms of redundancy\n(information shared by each oscillator) and synergy (information accessible\nonly through their joint observation). Our key results show that, near a\ncritical point marking a dynamical bifurcation, the system transitions from\npredominantly redundant to synergistic encoding. We further demonstrate that\nsynergy amplifies short-term responsiveness, thereby enhancing immediate memory\nretention, whereas strong dissipation leads to more redundant encoding that\nsupports long-term memory retention. These findings elucidate how the interplay\nof instability and dissipation shapes information processing in small quantum\nsystems, providing a fine-grained, information-theoretic perspective for\nanalyzing and designing QRC platforms.\n","authors":["Krai Cheamsawat","Thiparat Chotibut"],"pdf_url":"https://arxiv.org/pdf/2412.18290v1.pdf","comment":"30 pages, 12 figures"},{"id":"http://arxiv.org/abs/2412.18288v1","updated":"2024-12-24T08:52:06Z","published":"2024-12-24T08:52:06Z","title":"Towards understanding how attention mechanism works in deep learning","summary":" Attention mechanism has been extensively integrated within mainstream neural\nnetwork architectures, such as Transformers and graph attention networks. Yet,\nits underlying working principles remain somewhat elusive. What is its essence?\nAre there any connections between it and traditional machine learning\nalgorithms? In this study, we inspect the process of computing similarity using\nclassic metrics and vector space properties in manifold learning, clustering,\nand supervised learning. We identify the key characteristics of similarity\ncomputation and information propagation in these methods and demonstrate that\nthe self-attention mechanism in deep learning adheres to the same principles\nbut operates more flexibly and adaptively. We decompose the self-attention\nmechanism into a learnable pseudo-metric function and an information\npropagation process based on similarity computation. We prove that the\nself-attention mechanism converges to a drift-diffusion process through\ncontinuous modeling provided the pseudo-metric is a transformation of a metric\nand certain reasonable assumptions hold. This equation could be transformed\ninto a heat equation under a new metric. In addition, we give a first-order\nanalysis of attention mechanism with a general pseudo-metric function. This\nstudy aids in understanding the effects and principle of attention mechanism\nthrough physical intuition. Finally, we propose a modified attention mechanism\ncalled metric-attention by leveraging the concept of metric learning to\nfacilitate the ability to learn desired metrics more effectively. Experimental\nresults demonstrate that it outperforms self-attention regarding training\nefficiency, accuracy, and robustness.\n","authors":["Tianyu Ruan","Shihua Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18288v1.pdf","comment":"38 pages, 6 figures"},{"id":"http://arxiv.org/abs/2310.03977v3","updated":"2024-12-24T08:51:43Z","published":"2023-10-06T02:22:49Z","title":"Perfect Alignment May be Poisonous to Graph Contrastive Learning","summary":" Graph Contrastive Learning (GCL) aims to learn node representations by\naligning positive pairs and separating negative ones. However, few of\nresearchers have focused on the inner law behind specific augmentations used in\ngraph-based learning. What kind of augmentation will help downstream\nperformance, how does contrastive learning actually influence downstream tasks,\nand why the magnitude of augmentation matters so much? This paper seeks to\naddress these questions by establishing a connection between augmentation and\ndownstream performance. Our findings reveal that GCL contributes to downstream\ntasks mainly by separating different classes rather than gathering nodes of the\nsame class. So perfect alignment and augmentation overlap which draw all\nintra-class samples the same can not fully explain the success of contrastive\nlearning. Therefore, in order to understand how augmentation aids the\ncontrastive learning process, we conduct further investigations into the\ngeneralization, finding that perfect alignment that draw positive pair the same\ncould help contrastive loss but is poisonous to generalization, as a result,\nperfect alignment may not lead to best downstream performance, so specifically\ndesigned augmentation is needed to achieve appropriate alignment performance\nand improve downstream accuracy. We further analyse the result by information\ntheory and graph spectrum theory and propose two simple but effective methods\nto verify the theories. The two methods could be easily applied to various GCL\nalgorithms and extensive experiments are conducted to prove its effectiveness.\nThe code is available at https://github.com/somebodyhh1/GRACEIS\n","authors":["Jingyu Liu","Huayi Tang","Yong Liu"],"pdf_url":"https://arxiv.org/pdf/2310.03977v3.pdf","comment":"ICML 24"},{"id":"http://arxiv.org/abs/2412.18287v1","updated":"2024-12-24T08:48:48Z","published":"2024-12-24T08:48:48Z","title":"Semi-supervised Credit Card Fraud Detection via Attribute-Driven Graph\n Representation","summary":" Credit card fraud incurs a considerable cost for both cardholders and issuing\nbanks. Contemporary methods apply machine learning-based classifiers to detect\nfraudulent behavior from labeled transaction records. But labeled data are\nusually a small proportion of billions of real transactions due to expensive\nlabeling costs, which implies that they do not well exploit many natural\nfeatures from unlabeled data. Therefore, we propose a semi-supervised graph\nneural network for fraud detection. Specifically, we leverage transaction\nrecords to construct a temporal transaction graph, which is composed of\ntemporal transactions (nodes) and interactions (edges) among them. Then we pass\nmessages among the nodes through a Gated Temporal Attention Network (GTAN) to\nlearn the transaction representation. We further model the fraud patterns\nthrough risk propagation among transactions. The extensive experiments are\nconducted on a real-world transaction dataset and two publicly available fraud\ndetection datasets. The result shows that our proposed method, namely GTAN,\noutperforms other state-of-the-art baselines on three fraud detection datasets.\nSemi-supervised experiments demonstrate the excellent fraud detection\nperformance of our model with only a tiny proportion of labeled data.\n","authors":["Sheng Xiang","Mingzhi Zhu","Dawei Cheng","Enxia Li","Ruihui Zhao","Yi Ouyang","Ling Chen","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2412.18287v1.pdf","comment":"9 pages, 5 figures, AAAI 2023, code:\n https://github.com/AI4Risk/antifraud"},{"id":"http://arxiv.org/abs/2412.18283v1","updated":"2024-12-24T08:42:39Z","published":"2024-12-24T08:42:39Z","title":"On the Local Complexity of Linear Regions in Deep ReLU Networks","summary":" We define the local complexity of a neural network with continuous piecewise\nlinear activations as a measure of the density of linear regions over an input\ndata distribution. We show theoretically that ReLU networks that learn\nlow-dimensional feature representations have a lower local complexity. This\nallows us to connect recent empirical observations on feature learning at the\nlevel of the weight matrices with concrete properties of the learned functions.\nIn particular, we show that the local complexity serves as an upper bound on\nthe total variation of the function over the input data distribution and thus\nthat feature learning can be related to adversarial robustness. Lastly, we\nconsider how optimization drives ReLU networks towards solutions with lower\nlocal complexity. Overall, this work contributes a theoretical framework\ntowards relating geometric properties of ReLU networks to different aspects of\nlearning such as feature learning and representation cost.\n","authors":["Niket Patel","Guido Montufar"],"pdf_url":"https://arxiv.org/pdf/2412.18283v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18281v1","updated":"2024-12-24T08:42:01Z","published":"2024-12-24T08:42:01Z","title":"GDM4MMIMO: Generative Diffusion Models for Massive MIMO Communications","summary":" Massive multiple-input multiple-output (MIMO) offers significant advantages\nin spectral and energy efficiencies, positioning it as a cornerstone technology\nof fifth-generation (5G) wireless communication systems and a promising\nsolution for the burgeoning data demands anticipated in sixth-generation (6G)\nnetworks. In recent years, with the continuous advancement of artificial\nintelligence (AI), a multitude of task-oriented generative foundation models\n(GFMs) have emerged, achieving remarkable performance in various fields such as\ncomputer vision (CV), natural language processing (NLP), and autonomous\ndriving. As a pioneering force, these models are driving the paradigm shift in\nAI towards generative AI (GenAI). Among them, the generative diffusion model\n(GDM), as one of state-of-the-art families of generative models, demonstrates\nan exceptional capability to learn implicit prior knowledge and robust\ngeneralization capabilities, thereby enhancing its versatility and\neffectiveness across diverse applications. In this paper, we delve into the\npotential applications of GDM in massive MIMO communications. Specifically, we\nfirst provide an overview of massive MIMO communication, the framework of GFMs,\nand the working mechanism of GDM. Following this, we discuss recent research\nadvancements in the field and present a case study of near-field channel\nestimation based on GDM, demonstrating its promising potential for facilitating\nefficient ultra-dimensional channel statement information (CSI) acquisition in\nthe context of massive MIMO communications. Finally, we highlight several\npressing challenges in future mobile communications and identify promising\nresearch directions surrounding GDM.\n","authors":["Zhenzhou Jin","Li You","Huibin Zhou","Yuanshuo Wang","Xiaofeng Liu","Xinrui Gong","Xiqi Gao","Derrick Wing Kwan Ng","Xiang-Gen Xia"],"pdf_url":"https://arxiv.org/pdf/2412.18281v1.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.18277v1","updated":"2024-12-24T08:38:35Z","published":"2024-12-24T08:38:35Z","title":"Towards Modality Generalization: A Benchmark and Prospective Analysis","summary":" Multi-modal learning has achieved remarkable success by integrating\ninformation from various modalities, achieving superior performance in tasks\nlike recognition and retrieval compared to uni-modal approaches. However,\nreal-world scenarios often present novel modalities that are unseen during\ntraining due to resource and privacy constraints, a challenge current methods\nstruggle to address. This paper introduces Modality Generalization (MG), which\nfocuses on enabling models to generalize to unseen modalities. We define two\ncases: weak MG, where both seen and unseen modalities can be mapped into a\njoint embedding space via existing perceptors, and strong MG, where no such\nmappings exist. To facilitate progress, we propose a comprehensive benchmark\nfeaturing multi-modal algorithms and adapt existing methods that focus on\ngeneralization. Extensive experiments highlight the complexity of MG, exposing\nthe limitations of existing methods and identifying key directions for future\nresearch. Our work provides a foundation for advancing robust and adaptable\nmulti-modal models, enabling them to handle unseen modalities in realistic\nscenarios.\n","authors":["Xiaohao Liu","Xiaobo Xia","Zhuo Huang","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2412.18277v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18267v1","updated":"2024-12-24T08:27:33Z","published":"2024-12-24T08:27:33Z","title":"NoiseHGNN: Synthesized Similarity Graph-Based Neural Network For Noised\n Heterogeneous Graph Representation Learning","summary":" Real-world graph data environments intrinsically exist noise (e.g., link and\nstructure errors) that inevitably disturb the effectiveness of graph\nrepresentation and downstream learning tasks. For homogeneous graphs, the\nlatest works use original node features to synthesize a similarity graph that\ncan correct the structure of the noised graph. This idea is based on the\nhomogeneity assumption, which states that similar nodes in the homogeneous\ngraph tend to have direct links in the original graph. However, similar nodes\nin heterogeneous graphs usually do not have direct links, which can not be used\nto correct the original noise graph. This causes a significant challenge in\nnoised heterogeneous graph learning. To this end, this paper proposes a novel\nsynthesized similarity-based graph neural network compatible with noised\nheterogeneous graph learning. First, we calculate the original feature\nsimilarities of all nodes to synthesize a similarity-based high-order graph.\nSecond, we propose a similarity-aware encoder to embed original and synthesized\ngraphs with shared parameters. Then, instead of graph-to-graph supervising, we\nsynchronously supervise the original and synthesized graph embeddings to\npredict the same labels. Meanwhile, a target-based graph extracted from the\nsynthesized graph contrasts the structure of the metapath-based graph extracted\nfrom the original graph to learn the mutual information. Extensive experiments\nin numerous real-world datasets show the proposed method achieves\nstate-of-the-art records in the noised heterogeneous graph learning tasks. In\nhighlights, +5$\\sim$6\\% improvements are observed in several noised datasets\ncompared with previous SOTA methods. The code and datasets are available at\nhttps://github.com/kg-cc/NoiseHGNN.\n","authors":["Xiong Zhang","Cheng Xie","Haoran Duan","Beibei Yu"],"pdf_url":"https://arxiv.org/pdf/2412.18267v1.pdf","comment":"AAAI2025"},{"id":"http://arxiv.org/abs/2412.18263v1","updated":"2024-12-24T08:25:38Z","published":"2024-12-24T08:25:38Z","title":"Free the Design Space of Equivariant Graph Neural Networks: High-Rank\n Irreducible Cartesian Tensor Decomposition and Bases of Equivariant Spaces","summary":" Irreducible Cartesian tensors (ICTs) play a crucial role in the design of\nequivariant graph neural networks, as well as in theoretical chemistry and\nchemical physics. Meanwhile, the design space of available linear operations on\ntensors that preserve symmetry presents a significant challenge. The ICT\ndecomposition and a basis of this equivariant space are difficult to obtain for\nhigh-order tensors. After decades of research, we recently achieve an explicit\nICT decomposition for $n=5$ \\citep{bonvicini2024irreducible} with factorial\ntime/space complexity. This work, for the first time, obtains decomposition\nmatrices for ICTs up to rank $n=9$ with reduced and affordable complexity, by\nconstructing what we call path matrices. The path matrices are obtained via\nperforming chain-like contraction with Clebsch-Gordan matrices following the\nparentage scheme. We prove and leverage that the concatenation of path matrices\nis an orthonormal change-of-basis matrix between the Cartesian tensor product\nspace and the spherical direct sum spaces. Furthermore, we identify a complete\northogonal basis for the equivariant space, rather than a spanning set\n\\citep{pearce2023brauer}, through this path matrices technique. We further\nextend our result to the arbitrary tensor product and direct sum spaces,\nenabling free design between different spaces while keeping symmetry. The\nPython code is available in the appendix where the $n=6,\\dots,9$ ICT\ndecomposition matrices are obtained in <0.1s, 0.5s, 1s, 3s, 11s, and 4m32s,\nrespectively.\n","authors":["Shihao Shao","Yikang Li","Zhouchen Lin","Qinghua Cui"],"pdf_url":"https://arxiv.org/pdf/2412.18263v1.pdf","comment":"46 pages, 4 code snippets"},{"id":"http://arxiv.org/abs/2412.18262v1","updated":"2024-12-24T08:24:10Z","published":"2024-12-24T08:24:10Z","title":"Efficient Contrastive Explanations on Demand","summary":" Recent work revealed a tight connection between adversarial robustness and\nrestricted forms of symbolic explanations, namely distance-based (formal)\nexplanations. This connection is significant because it represents a first step\ntowards making the computation of symbolic explanations as efficient as\ndeciding the existence of adversarial examples, especially for highly complex\nmachine learning (ML) models. However, a major performance bottleneck remains,\nbecause of the very large number of features that ML models may possess, in\nparticular for deep neural networks. This paper proposes novel algorithms to\ncompute the so-called contrastive explanations for ML models with a large\nnumber of features, by leveraging on adversarial robustness. Furthermore, the\npaper also proposes novel algorithms for listing explanations and finding\nsmallest contrastive explanations. The experimental results demonstrate the\nperformance gains achieved by the novel algorithms proposed in this paper.\n","authors":["Yacine Izza","Joao Marques-Silva"],"pdf_url":"https://arxiv.org/pdf/2412.18262v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2405.08297"},{"id":"http://arxiv.org/abs/2311.13015v3","updated":"2024-12-24T08:19:29Z","published":"2023-11-21T21:44:28Z","title":"Fast and Interpretable Mortality Risk Scores for Critical Care Patients","summary":" Prediction of mortality in intensive care unit (ICU) patients typically\nrelies on black box models (that are unacceptable for use in hospitals) or\nhand-tuned interpretable models (that might lead to the loss in performance).\nWe aim to bridge the gap between these two categories by building on modern\ninterpretable ML techniques to design interpretable mortality risk scores that\nare as accurate as black boxes. We developed a new algorithm, GroupFasterRisk,\nwhich has several important benefits: it uses both hard and soft direct\nsparsity regularization, it incorporates group sparsity to allow more cohesive\nmodels, it allows for monotonicity constraint to include domain knowledge, and\nit produces many equally-good models, which allows domain experts to choose\namong them. For evaluation, we leveraged the largest existing public ICU\nmonitoring datasets (MIMIC III and eICU). Models produced by GroupFasterRisk\noutperformed OASIS and SAPS II scores and performed similarly to APACHE IV/IVa\nwhile using at most a third of the parameters. For patients with\nsepsis/septicemia, acute myocardial infarction, heart failure, and acute kidney\nfailure, GroupFasterRisk models outperformed OASIS and SOFA. Finally, different\nmortality prediction ML approaches performed better based on variables selected\nby GroupFasterRisk as compared to OASIS variables. GroupFasterRisk's models\nperformed better than risk scores currently used in hospitals, and on par with\nblack box ML models, while being orders of magnitude sparser. Because\nGroupFasterRisk produces a variety of risk scores, it allows design flexibility\n- the key enabler of practical model creation. GroupFasterRisk is a fast,\naccessible, and flexible procedure that allows learning a diverse set of sparse\nrisk scores for mortality prediction.\n","authors":["Chloe Qinyu Zhu","Muhang Tian","Lesia Semenova","Jiachang Liu","Jack Xu","Joseph Scarpa","Cynthia Rudin"],"pdf_url":"https://arxiv.org/pdf/2311.13015v3.pdf","comment":"This article has been accepted for publication in the Journal of the\n American Medical Informatics Association, published by Oxford University\n Press"},{"id":"http://arxiv.org/abs/2410.15342v2","updated":"2024-12-24T08:13:43Z","published":"2024-10-20T09:32:03Z","title":"ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal\n Steps","summary":" Singing voice synthesis (SVS) system is expected to generate high-fidelity\nsinging voice from given music scores (lyrics, duration and pitch). Recently,\ndiffusion models have performed well in this field. However, sacrificing\ninference speed to exchange with high-quality sample generation limits its\napplication scenarios. In order to obtain high quality synthetic singing voice\nmore efficiently, we propose a singing voice synthesis method based on the\nconsistency model, ConSinger, to achieve high-fidelity singing voice synthesis\nwith minimal steps. The model is trained by applying consistency constraint and\nthe generation quality is greatly improved at the expense of a small amount of\ninference speed. Our experiments show that ConSinger is highly competitive with\nthe baseline model in terms of generation speed and quality. Audio samples are\navailable at https://keylxiao.github.io/consinger.\n","authors":["Yulin Song","Guorui Sang","Jing Yu","Chuangbai Xiao"],"pdf_url":"https://arxiv.org/pdf/2410.15342v2.pdf","comment":"Singing voice synthesis, Consistency models, Shallow Diffusion\n Mechanism; Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.18256v1","updated":"2024-12-24T08:13:01Z","published":"2024-12-24T08:13:01Z","title":"Robust Semi-Supervised Learning in Open Environments","summary":" Semi-supervised learning (SSL) aims to improve performance by exploiting\nunlabeled data when labels are scarce. Conventional SSL studies typically\nassume close environments where important factors (e.g., label, feature,\ndistribution) between labeled and unlabeled data are consistent. However, more\npractical tasks involve open environments where important factors between\nlabeled and unlabeled data are inconsistent. It has been reported that\nexploiting inconsistent unlabeled data causes severe performance degradation,\neven worse than the simple supervised learning baseline. Manually verifying the\nquality of unlabeled data is not desirable, therefore, it is important to study\nrobust SSL with inconsistent unlabeled data in open environments. This paper\nbriefly introduces some advances in this line of research, focusing on\ntechniques concerning label, feature, and data distribution inconsistency in\nSSL, and presents the evaluation benchmarks. Open research problems are also\ndiscussed for reference purposes.\n","authors":["Lan-Zhe Guo","Lin-Han Jia","Jie-Jing Shao","Yu-Feng Li"],"pdf_url":"https://arxiv.org/pdf/2412.18256v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2405.08297v2","updated":"2024-12-24T08:12:26Z","published":"2024-05-14T03:42:33Z","title":"Distance-Restricted Explanations: Theoretical Underpinnings & Efficient\n Implementation","summary":" The uses of machine learning (ML) have snowballed in recent years. In many\ncases, ML models are highly complex, and their operation is beyond the\nunderstanding of human decision-makers. Nevertheless, some uses of ML models\ninvolve high-stakes and safety-critical applications. Explainable artificial\nintelligence (XAI) aims to help human decision-makers in understanding the\noperation of such complex ML models, thus eliciting trust in their operation.\nUnfortunately, the majority of past XAI work is based on informal approaches,\nthat offer no guarantees of rigor. Unsurprisingly, there exists comprehensive\nexperimental and theoretical evidence confirming that informal methods of XAI\ncan provide human-decision makers with erroneous information. Logic-based XAI\nrepresents a rigorous approach to explainability; it is model-based and offers\nthe strongest guarantees of rigor of computed explanations. However, a\nwell-known drawback of logic-based XAI is the complexity of logic reasoning,\nespecially for highly complex ML models. Recent work proposed\ndistance-restricted explanations, i.e. explanations that are rigorous provided\nthe distance to a given input is small enough. Distance-restricted\nexplainability is tightly related with adversarial robustness, and it has been\nshown to scale for moderately complex ML models, but the number of inputs still\nrepresents a key limiting factor. This paper investigates novel algorithms for\nscaling up the performance of logic-based explainers when computing and\nenumerating ML model explanations with a large number of inputs.\n","authors":["Yacine Izza","Xuanxiang Huang","Antonio Morgado","Jordi Planes","Alexey Ignatiev","Joao Marques-Silva"],"pdf_url":"https://arxiv.org/pdf/2405.08297v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18248v1","updated":"2024-12-24T08:02:43Z","published":"2024-12-24T08:02:43Z","title":"Detection and Forecasting of Parkinson Disease Progression from Speech\n Signal Features Using MultiLayer Perceptron and LSTM","summary":" Accurate diagnosis of Parkinson disease, especially in its early stages, can\nbe a challenging task. The application of machine learning techniques helps\nimprove the diagnostic accuracy of Parkinson disease detection but only few\nstudies have presented work towards the prediction of disease progression. In\nthis research work, Long Short Term Memory LSTM was trained using the\ndiagnostic features on Parkinson patients speech signals, to predict the\ndisease progression while a Multilayer Perceptron MLP was trained on the same\ndiagnostic features to detect the disease. Diagnostic features selected using\ntwo well-known feature selection methods named Relief-F and Sequential Forward\nSelection and applied on LSTM and MLP have shown to accurately predict the\ndisease progression as stage 2 and 3 and its existence respectively.\n","authors":["Majid Ali","Hina Shakir","Asia Samreen","Sohaib Ahmed"],"pdf_url":"https://arxiv.org/pdf/2412.18248v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18247v1","updated":"2024-12-24T08:02:28Z","published":"2024-12-24T08:02:28Z","title":"Fréchet regression for multi-label feature selection with implicit\n regularization","summary":" Fr\\'echet regression extends linear regression to model complex responses\n in metric spaces, making it particularly relevant for multi-label regression,\n where each instance can have multiple associated labels. However, variable\n selection within this framework remains underexplored. In this paper, we pro\npose a novel variable selection method that employs implicit regularization\n instead of traditional explicit regularization approaches, which can\nintroduce\n bias. Our method effectively captures nonlinear interactions between predic\ntors and responses while promoting model sparsity. We provide theoretical\n results demonstrating selection consistency and illustrate the performance of\n our approach through numerical examples\n","authors":["Dou El Kefel Mansouri","Seif-Eddine Benkabou","Khalid Benabdeslem"],"pdf_url":"https://arxiv.org/pdf/2412.18247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.09945v2","updated":"2024-12-24T08:01:20Z","published":"2024-10-13T18:03:53Z","title":"Variational Diffusion Posterior Sampling with Midpoint Guidance","summary":" Diffusion models have recently shown considerable potential in solving\nBayesian inverse problems when used as priors. However, sampling from the\nresulting denoising posterior distributions remains a challenge as it involves\nintractable terms. To tackle this issue, state-of-the-art approaches formulate\nthe problem as that of sampling from a surrogate diffusion model targeting the\nposterior and decompose its scores into two terms: the prior score and an\nintractable guidance term. While the former is replaced by the pre-trained\nscore of the considered diffusion model, the guidance term has to be estimated.\nIn this paper, we propose a novel approach that utilises a decomposition of the\ntransitions which, in contrast to previous methods, allows a trade-off between\nthe complexity of the intractable guidance term and that of the prior\ntransitions. We validate the proposed approach through extensive experiments on\nlinear and nonlinear inverse problems, including challenging cases with latent\ndiffusion models as priors. We then demonstrate its applicability to various\nmodalities and its promising impact on public health by tackling cardiovascular\ndisease diagnosis through the reconstruction of incomplete electrocardiograms.\nThe code is publicly available at \\url{https://github.com/yazidjanati/mgps}.\n","authors":["Badr Moufad","Yazid Janati","Lisa Bedin","Alain Durmus","Randal Douc","Eric Moulines","Jimmy Olsson"],"pdf_url":"https://arxiv.org/pdf/2410.09945v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.02688v2","updated":"2024-12-24T07:47:02Z","published":"2024-11-05T00:16:01Z","title":"On the loss of context-awareness in general instruction fine-tuning","summary":" Pre-trained Large Language Models (LLMs) require post-training methods such\nas supervised fine-tuning (SFT) on instruction-response pairs to enable\ninstruction following. However, this process can potentially harm existing\ncapabilities learned during pre-training. In this paper, we investigate the\nloss of context awareness after SFT, where context awareness is defined as the\nability to extract and understand information from user-provided context and\nrespond accordingly. We are the first to identify and show that the loss of\ncontext awareness, as reflected by the performance drop in the\nNeedle-in-a-Haystack test, occurs in instruction fine-tuned LLMs when the chat\ntemplate is applied to input prompts. We identify that the performance decline\nis partially caused by an attention bias toward different roles learned during\nconversational instruction fine-tuning. We validate our hypothesis by\nvisualizing changes in attention allocation after the chat template is applied\nand manually steering the attention heads. Based on these observations, we\npropose a metric to select context-dependent examples from general instruction\nfine-tuning datasets. We then apply conditional instruction fine-tuning with a\ncontext-dependency indicator, enabling the model to learn context awareness\nfrom these selected examples. Empirical experiments on four context-dependent\ndownstream tasks and three pre-trained LLMs of different sizes show that our\nmethod effectively mitigates the loss of context awareness without compromising\ngeneral instruction-following capabilities. Given our findings, we strongly\nadvocate for careful benchmarking of context awareness after instruction\nfine-tuning.\n","authors":["Yihan Wang","Andrew Bai","Nanyun Peng","Cho-Jui Hsieh"],"pdf_url":"https://arxiv.org/pdf/2411.02688v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18239v1","updated":"2024-12-24T07:46:50Z","published":"2024-12-24T07:46:50Z","title":"OMG-HD: A High-Resolution AI Weather Model for End-to-End Forecasts from\n Observations","summary":" In recent years, Artificial Intelligence Weather Prediction (AIWP) models\nhave achieved performance comparable to, or even surpassing, traditional\nNumerical Weather Prediction (NWP) models by leveraging reanalysis data.\nHowever, a less-explored approach involves training AIWP models directly on\nobservational data, enhancing computational efficiency and improving forecast\naccuracy by reducing the uncertainties introduced through data assimilation\nprocesses. In this study, we propose OMG-HD, a novel AI-based regional\nhigh-resolution weather forecasting model designed to make predictions directly\nfrom observational data sources, including surface stations, radar, and\nsatellite, thereby removing the need for operational data assimilation. Our\nevaluation shows that OMG-HD outperforms both the European Centre for\nMedium-Range Weather Forecasts (ECMWF)'s high-resolution operational\nforecasting system, IFS-HRES, and the High-Resolution Rapid Refresh (HRRR)\nmodel at lead times of up to 12 hours across the contiguous United States\n(CONUS) region. We achieve up to a 13% improvement on RMSE for 2-meter\ntemperature, 17% on 10-meter wind speed, 48% on 2-meter specific humidity, and\n32% on surface pressure compared to HRRR. Our method shows that it is possible\nto use AI-driven approaches for rapid weather predictions without relying on\nNWP-derived weather fields as model input. This is a promising step towards\nusing observational data directly to make operational forecasts with AIWP\nmodels.\n","authors":["Pengcheng Zhao","Jiang Bian","Zekun Ni","Weixin Jin","Jonathan Weyn","Zuliang Fang","Siqi Xiang","Haiyu Dong","Bin Zhang","Hongyu Sun","Kit Thambiratnam","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18239v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16899v2","updated":"2024-12-24T07:43:17Z","published":"2024-12-22T07:20:17Z","title":"Integrating Random Effects in Variational Autoencoders for\n Dimensionality Reduction of Correlated Data","summary":" Variational Autoencoders (VAE) are widely used for dimensionality reduction\nof large-scale tabular and image datasets, under the assumption of independence\nbetween data observations. In practice, however, datasets are often correlated,\nwith typical sources of correlation including spatial, temporal and clustering\nstructures. Inspired by the literature on linear mixed models (LMM), we propose\nLMMVAE -- a novel model which separates the classic VAE latent model into fixed\nand random parts. While the fixed part assumes the latent variables are\nindependent as usual, the random part consists of latent variables which are\ncorrelated between similar clusters in the data such as nearby locations or\nsuccessive measurements. The classic VAE architecture and loss are modified\naccordingly. LMMVAE is shown to improve squared reconstruction error and\nnegative likelihood loss significantly on unseen data, with simulated as well\nas real datasets from various applications and correlation scenarios. It also\nshows improvement in the performance of downstream tasks such as supervised\nclassification on the learned representations.\n","authors":["Giora Simchoni","Saharon Rosset"],"pdf_url":"https://arxiv.org/pdf/2412.16899v2.pdf","comment":"30 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.18237v1","updated":"2024-12-24T07:43:14Z","published":"2024-12-24T07:43:14Z","title":"Schödinger Bridge Type Diffusion Models as an Extension of Variational\n Autoencoders","summary":" Generative diffusion models use time-forward and backward stochastic\ndifferential equations to connect the data and prior distributions. While\nconventional diffusion models (e.g., score-based models) only learn the\nbackward process, more flexible frameworks have been proposed to also learn the\nforward process by employing the Schr\\\"odinger bridge (SB). However, due to the\ncomplexity of the mathematical structure behind SB-type models, we can not\neasily give an intuitive understanding of their objective function. In this\nwork, we propose a unified framework to construct diffusion models by\nreinterpreting the SB-type models as an extension of variational autoencoders.\nIn this context, the data processing inequality plays a crucial role. As a\nresult, we find that the objective function consists of the prior loss and\ndrift matching parts.\n","authors":["Kentaro Kaba","Reo Shimizu","Masayuki Ohzeki","Yuki Sughiyama"],"pdf_url":"https://arxiv.org/pdf/2412.18237v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2110.09823v5","updated":"2024-12-24T07:36:33Z","published":"2021-10-19T10:15:00Z","title":"An Empirical Study: Extensive Deep Temporal Point Process","summary":" Temporal point process as the stochastic process on continuous domain of time\nis commonly used to model the asynchronous event sequence featuring with\noccurrence timestamps. Thanks to the strong expressivity of deep neural\nnetworks, they are emerging as a promising choice for capturing the patterns in\nasynchronous sequences, in the context of temporal point process. In this\npaper, we first review recent research emphasis and difficulties in modeling\nasynchronous event sequences with deep temporal point process, which can be\nconcluded into four fields: encoding of history sequence, formulation of\nconditional intensity function, relational discovery of events and learning\napproaches for optimization. We introduce most of recently proposed models by\ndismantling them into the four parts, and conduct experiments by remodularizing\nthe first three parts with the same learning strategy for a fair empirical\nevaluation. Besides, we extend the history encoders and conditional intensity\nfunction family, and propose a Granger causality discovery framework for\nexploiting the relations among multi-types of events. Because the Granger\ncausality can be represented by the Granger causality graph, discrete graph\nstructure learning in the framework of Variational Inference is employed to\nreveal latent structures of the graph. Further experiments show that the\nproposed framework with latent graph discovery can both capture the relations\nand achieve an improved fitting and predicting performance.\n","authors":["Haitao Lin","Cheng Tan","Lirong Wu","Zhangyang Gao","Zicheng Liu","Stan. Z. Li"],"pdf_url":"https://arxiv.org/pdf/2110.09823v5.pdf","comment":"22 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.18234v1","updated":"2024-12-24T07:35:48Z","published":"2024-12-24T07:35:48Z","title":"Conditional Deep Canonical Time Warping","summary":" Temporal alignment of sequences is a fundamental challenge in many\napplications, such as computer vision and bioinformatics, where local time\nshifting needs to be accounted for. Misalignment can lead to poor model\ngeneralization, especially in high-dimensional sequences. Existing methods\noften struggle with optimization when dealing with high-dimensional sparse\ndata, falling into poor alignments. Feature selection is frequently used to\nenhance model performance for sparse data. However, a fixed set of selected\nfeatures would not generally work for dynamically changing sequences and would\nneed to be modified based on the state of the sequence. Therefore, modifying\nthe selected feature based on contextual input would result in better\nalignment. Our suggested method, Conditional Deep Canonical Temporal Time\nWarping (CDCTW), is designed for temporal alignment in sparse temporal data to\naddress these challenges. CDCTW enhances alignment accuracy for high\ndimensional time-dependent views be performing dynamic time warping on data\nembedded in maximally correlated subspace which handles sparsity with novel\nfeature selection method. We validate the effectiveness of CDCTW through\nextensive experiments on various datasets, demonstrating superior performance\nover previous techniques.\n","authors":["Afek Steinberg","Ran Eisenberg","Ofir Lindenbaum"],"pdf_url":"https://arxiv.org/pdf/2412.18234v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18231v1","updated":"2024-12-24T07:30:20Z","published":"2024-12-24T07:30:20Z","title":"Towards Macro-AUC oriented Imbalanced Multi-Label Continual Learning","summary":" In Continual Learning (CL), while existing work primarily focuses on the\nmulti-class classification task, there has been limited research on Multi-Label\nLearning (MLL). In practice, MLL datasets are often class-imbalanced, making it\ninherently challenging, a problem that is even more acute in CL. Due to its\nsensitivity to imbalance, Macro-AUC is an appropriate and widely used measure\nin MLL. However, there is no research to optimize Macro-AUC in MLCL\nspecifically. To fill this gap, in this paper, we propose a new memory\nreplay-based method to tackle the imbalance issue for Macro-AUC-oriented MLCL.\nSpecifically, inspired by recent theory work, we propose a new Reweighted\nLabel-Distribution-Aware Margin (RLDAM) loss. Furthermore, to be compatible\nwith the RLDAM loss, a new memory-updating strategy named Weight Retain\nUpdating (WRU) is proposed to maintain the numbers of positive and negative\ninstances of the original dataset in memory. Theoretically, we provide superior\ngeneralization analyses of the RLDAM-based algorithm in terms of Macro-AUC,\nseparately in batch MLL and MLCL settings. This is the first work to offer\ntheoretical generalization analyses in MLCL to our knowledge. Finally, a series\nof experimental results illustrate the effectiveness of our method over several\nbaselines. Our codes are available at\nhttps://github.com/ML-Group-SDU/Macro-AUC-CL.\n","authors":["Yan Zhang","Guoqiang Wu","Bingzheng Wang","Teng Pang","Haoliang Sun","Yilong Yin"],"pdf_url":"https://arxiv.org/pdf/2412.18231v1.pdf","comment":"7 pages of main text, 11 pages of appendix, accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2408.08685v3","updated":"2024-12-24T07:08:45Z","published":"2024-08-16T11:58:34Z","title":"Can Large Language Models Improve the Adversarial Robustness of Graph\n Neural Networks?","summary":" Graph neural networks (GNNs) are vulnerable to adversarial attacks,\nespecially for topology perturbations, and many methods that improve the\nrobustness of GNNs have received considerable attention. Recently, we have\nwitnessed the significant success of large language models (LLMs), leading many\nto explore the great potential of LLMs on GNNs. However, they mainly focus on\nimproving the performance of GNNs by utilizing LLMs to enhance the node\nfeatures. Therefore, we ask: Will the robustness of GNNs also be enhanced with\nthe powerful understanding and inference capabilities of LLMs? By presenting\nthe empirical results, we find that despite that LLMs can improve the\nrobustness of GNNs, there is still an average decrease of 23.1% in accuracy,\nimplying that the GNNs remain extremely vulnerable against topology attacks.\nTherefore, another question is how to extend the capabilities of LLMs on graph\nadversarial robustness. In this paper, we propose an LLM-based robust graph\nstructure inference framework, LLM4RGNN, which distills the inference\ncapabilities of GPT-4 into a local LLM for identifying malicious edges and an\nLM-based edge predictor for finding missing important edges, so as to recover a\nrobust graph structure. Extensive experiments demonstrate that LLM4RGNN\nconsistently improves the robustness across various GNNs. Even in some cases\nwhere the perturbation ratio increases to 40%, the accuracy of GNNs is still\nbetter than that on the clean graph. The source code can be found in\nhttps://github.com/zhongjian-zhang/LLM4RGNN.\n","authors":["Zhongjian Zhang","Xiao Wang","Huichi Zhou","Yue Yu","Mengmei Zhang","Cheng Yang","Chuan Shi"],"pdf_url":"https://arxiv.org/pdf/2408.08685v3.pdf","comment":"accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2412.18222v1","updated":"2024-12-24T07:07:14Z","published":"2024-12-24T07:07:14Z","title":"Leveraging Convolutional Neural Network-Transformer Synergy for\n Predictive Modeling in Risk-Based Applications","summary":" With the development of the financial industry, credit default prediction, as\nan important task in financial risk management, has received increasing\nattention. Traditional credit default prediction methods mostly rely on machine\nlearning models, such as decision trees and random forests, but these methods\nhave certain limitations in processing complex data and capturing potential\nrisk patterns. To this end, this paper proposes a deep learning model based on\nthe combination of convolutional neural networks (CNN) and Transformer for\ncredit user default prediction. The model combines the advantages of CNN in\nlocal feature extraction with the ability of Transformer in global dependency\nmodeling, effectively improving the accuracy and robustness of credit default\nprediction. Through experiments on public credit default datasets, the results\nshow that the CNN+Transformer model outperforms traditional machine learning\nmodels, such as random forests and XGBoost, in multiple evaluation indicators\nsuch as accuracy, AUC, and KS value, demonstrating its powerful ability in\ncomplex financial data modeling. Further experimental analysis shows that\nappropriate optimizer selection and learning rate adjustment play a vital role\nin improving model performance. In addition, the ablation experiment of the\nmodel verifies the advantages of the combination of CNN and Transformer and\nproves the complementarity of the two in credit default prediction. This study\nprovides a new idea for credit default prediction and provides strong support\nfor risk assessment and intelligent decision-making in the financial field.\nFuture research can further improve the prediction effect and generalization\nability by introducing more unstructured data and improving the model\narchitecture.\n","authors":["Yuhan Wang","Zhen Xu","Yue Yao","Jinsong Liu","Jiating Lin"],"pdf_url":"https://arxiv.org/pdf/2412.18222v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18221v1","updated":"2024-12-24T07:05:55Z","published":"2024-12-24T07:05:55Z","title":"GIMS: Image Matching System Based on Adaptive Graph Construction and\n Graph Neural Network","summary":" Feature-based image matching has extensive applications in computer vision.\nKeypoints detected in images can be naturally represented as graph structures,\nand Graph Neural Networks (GNNs) have been shown to outperform traditional deep\nlearning techniques. Consequently, the paradigm of image matching via GNNs has\ngained significant prominence in recent academic research. In this paper, we\nfirst introduce an innovative adaptive graph construction method that utilizes\na filtering mechanism based on distance and dynamic threshold similarity. This\nmethod dynamically adjusts the criteria for incorporating new vertices based on\nthe characteristics of existing vertices, allowing for the construction of more\nprecise and robust graph structures while avoiding redundancy. We further\ncombine the vertex processing capabilities of GNNs with the global awareness\ncapabilities of Transformers to enhance the model's representation of spatial\nand feature information within graph structures. This hybrid model provides a\ndeeper understanding of the interrelationships between vertices and their\ncontributions to the matching process. Additionally, we employ the Sinkhorn\nalgorithm to iteratively solve for optimal matching results. Finally, we\nvalidate our system using extensive image datasets and conduct comprehensive\ncomparative experiments. Experimental results demonstrate that our system\nachieves an average improvement of 3.8x-40.3x in overall matching performance.\nAdditionally, the number of vertices and edges significantly impacts training\nefficiency and memory usage; therefore, we employ multi-GPU technology to\naccelerate the training process. Our code is available at\nhttps://github.com/songxf1024/GIMS.\n","authors":["Xianfeng Song","Yi Zou","Zheng Shi","Zheng Liu"],"pdf_url":"https://arxiv.org/pdf/2412.18221v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17155v2","updated":"2024-12-24T07:01:36Z","published":"2024-12-22T20:33:59Z","title":"The Potential of Convolutional Neural Networks for Cancer Detection","summary":" Early detection of cancer is critical in improving treatment outcomes and\nincreasing survival rates, particularly for common cancers such as lung,\nbreast, and prostate which collectively contribute to a significant global\nmortality burden. With advancements in imaging technologies and data\nprocessing, Convolutional Neural Networks (CNNs) have emerged as a powerful\ntool for analyzing and classifying medical images, enabling more precise cancer\ndetection. This paper provides a comprehensive review of recent studies\nleveraging CNN models for detecting ten different types of cancer. Each study\nemploys distinct CNN architectures to identify patterns associated with these\ncancers, utilizing diverse datasets. Key differences and strengths of these\narchitectures are meticulously compared and analyzed, highlighting their\nefficacy in improving early detection. Beyond reviewing the performance and\nlimitations of CNN-based cancer detection methods, this study explores the\nfeasibility of integrating CNNs into clinical settings as an early detection\ntool, potentially complementing or replacing traditional methods. Despite\nsignificant progress, challenges remain, including data diversity, result\ninterpretation, and ethical considerations. By identifying the best-performing\nCNN architectures and providing a comparative analysis, this study aims to\ncontribute a comprehensive perspective on the application of CNNs in cancer\ndetection and their role in advancing diagnostic capabilities in healthcare.\n","authors":["Hossein Molaeian","Kaveh Karamjani","Sina Teimouri","Saeed Roshani","Sobhan Roshani"],"pdf_url":"https://arxiv.org/pdf/2412.17155v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18218v1","updated":"2024-12-24T06:55:53Z","published":"2024-12-24T06:55:53Z","title":"On the Effectiveness of Adversarial Training on Malware Classifiers","summary":" Adversarial Training (AT) has been widely applied to harden learning-based\nclassifiers against adversarial evasive attacks. However, its effectiveness in\nidentifying and strengthening vulnerable areas of the model's decision space\nwhile maintaining high performance on clean data of malware classifiers remains\nan under-explored area. In this context, the robustness that AT achieves has\noften been assessed against unrealistic or weak adversarial attacks, which\nnegatively affect performance on clean data and are arguably no longer threats.\nPrevious work seems to suggest robustness is a task-dependent property of AT.\nWe instead argue it is a more complex problem that requires exploring AT and\nthe intertwined roles played by certain factors within data, feature\nrepresentations, classifiers, and robust optimization settings, as well as\nproper evaluation factors, such as the realism of evasion attacks, to gain a\ntrue sense of AT's effectiveness. In our paper, we address this gap by\nsystematically exploring the role such factors have in hardening malware\nclassifiers through AT. Contrary to recent prior work, a key observation of our\nresearch and extensive experiments confirm the hypotheses that all such factors\ninfluence the actual effectiveness of AT, as demonstrated by the varying\ndegrees of success from our empirical analysis. We identify five evaluation\npitfalls that affect state-of-the-art studies and summarize our insights in ten\ntakeaways to draw promising research directions toward better understanding the\nfactors' settings under which adversarial training works at best.\n","authors":["Hamid Bostani","Jacopo Cortellazzi","Daniel Arp","Fabio Pierazzi","Veelasha Moonsamy","Lorenzo Cavallaro"],"pdf_url":"https://arxiv.org/pdf/2412.18218v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18217v1","updated":"2024-12-24T06:51:21Z","published":"2024-12-24T06:51:21Z","title":"U-Mamba-Net: A highly efficient Mamba-based U-net style network for\n noisy and reverberant speech separation","summary":" The topic of speech separation involves separating mixed speech with multiple\noverlapping speakers into several streams, with each stream containing speech\nfrom only one speaker. Many highly effective models have emerged and\nproliferated rapidly over time. However, the size and computational load of\nthese models have also increased accordingly. This is a disaster for the\ncommunity, as researchers need more time and computational resources to\nreproduce and compare existing models. In this paper, we propose U-mamba-net: a\nlightweight Mamba-based U-style model for speech separation in complex\nenvironments. Mamba is a state space sequence model that incorporates feature\nselection capabilities. U-style network is a fully convolutional neural network\nwhose symmetric contracting and expansive paths are able to learn\nmulti-resolution features. In our work, Mamba serves as a feature filter,\nalternating with U-Net. We test the proposed model on Libri2mix. The results\nshow that U-Mamba-Net achieves improved performance with quite low\ncomputational cost.\n","authors":["Shaoxiang Dang","Tetsuya Matsumoto","Yoshinori Takeuchi","Hiroaki Kudo"],"pdf_url":"https://arxiv.org/pdf/2412.18217v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18212v1","updated":"2024-12-24T06:40:13Z","published":"2024-12-24T06:40:13Z","title":"Accelerating AIGC Services with Latent Action Diffusion Scheduling in\n Edge Networks","summary":" Artificial Intelligence Generated Content (AIGC) has gained significant\npopularity for creating diverse content. Current AIGC models primarily focus on\ncontent quality within a centralized framework, resulting in a high service\ndelay and negative user experiences. However, not only does the workload of an\nAIGC task depend on the AIGC model's complexity rather than the amount of data,\nbut the large model and its multi-layer encoder structure also result in a huge\ndemand for computational and memory resources. These unique characteristics\npose new challenges in its modeling, deployment, and scheduling at edge\nnetworks. Thus, we model an offloading problem among edges for providing real\nAIGC services and propose LAD-TS, a novel Latent Action Diffusion-based Task\nScheduling method that orchestrates multiple edge servers for expedited AIGC\nservices. The LAD-TS generates a near-optimal offloading decision by leveraging\nthe diffusion model's conditional generation capability and the reinforcement\nlearning's environment interaction ability, thereby minimizing the service\ndelays under multiple resource constraints. Meanwhile, a latent action\ndiffusion strategy is designed to guide decision generation by utilizing\nhistorical action probability, enabling rapid achievement of near-optimal\ndecisions. Furthermore, we develop DEdgeAI, a prototype edge system with a\nrefined AIGC model deployment to implement and evaluate our LAD-TS method.\nDEdgeAI provides a real AIGC service for users, demonstrating up to 29.18%\nshorter service delays than the current five representative AIGC platforms. We\nrelease our open-source code at https://github.com/ChangfuXu/DEdgeAI/.\n","authors":["Changfu Xu","Jianxiong Guo","Wanyu Lin","Haodong Zou","Wentao Fan","Tian Wang","Xiaowen Chu","Jiannong Cao"],"pdf_url":"https://arxiv.org/pdf/2412.18212v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2412.18208v1","updated":"2024-12-24T06:28:34Z","published":"2024-12-24T06:28:34Z","title":"Quantum framework for Reinforcement Learning: integrating Markov\n Decision Process, quantum arithmetic, and trajectory search","summary":" This paper introduces a quantum framework for addressing reinforcement\nlearning (RL) tasks, grounded in the quantum principles and leveraging a fully\nquantum model of the classical Markov Decision Process (MDP). By employing\nquantum concepts and a quantum search algorithm, this work presents the\nimplementation and optimization of the agent-environment interactions entirely\nwithin the quantum domain, eliminating reliance on classical computations. Key\ncontributions include the quantum-based state transitions, return calculation,\nand trajectory search mechanism that utilize quantum principles to demonstrate\nthe realization of RL processes through quantum phenomena. The implementation\nemphasizes the fundamental role of quantum superposition in enhancing\ncomputational efficiency for RL tasks. Experimental results demonstrate the\ncapacity of a quantum model to achieve quantum advantage in RL, highlighting\nthe potential of fully quantum implementations in decision-making tasks. This\nwork not only underscores the applicability of quantum computing in machine\nlearning but also contributes the field of quantum reinforcement learning (QRL)\nby offering a robust framework for understanding and exploiting quantum\ncomputing in RL systems.\n","authors":["Thet Htar Su","Shaswot Shresthamali","Masaaki Kondo"],"pdf_url":"https://arxiv.org/pdf/2412.18208v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18207v1","updated":"2024-12-24T06:24:08Z","published":"2024-12-24T06:24:08Z","title":"Sharper Error Bounds in Late Fusion Multi-view Clustering Using\n Eigenvalue Proportion","summary":" Multi-view clustering (MVC) aims to integrate complementary information from\nmultiple views to enhance clustering performance. Late Fusion Multi-View\nClustering (LFMVC) has shown promise by synthesizing diverse clustering results\ninto a unified consensus. However, current LFMVC methods struggle with noisy\nand redundant partitions and often fail to capture high-order correlations\nacross views. To address these limitations, we present a novel theoretical\nframework for analyzing the generalization error bounds of multiple kernel\n$k$-means, leveraging local Rademacher complexity and principal eigenvalue\nproportions. Our analysis establishes a convergence rate of $\\mathcal{O}(1/n)$,\nsignificantly improving upon the existing rate in the order of\n$\\mathcal{O}(\\sqrt{k/n})$. Building on this insight, we propose a low-pass\ngraph filtering strategy within a multiple linear $k$-means framework to\nmitigate noise and redundancy, further refining the principal eigenvalue\nproportion and enhancing clustering accuracy. Experimental results on benchmark\ndatasets confirm that our approach outperforms state-of-the-art methods in\nclustering performance and robustness. The related codes is available at\nhttps://github.com/csliangdu/GMLKM .\n","authors":["Liang Du","Henghui Jiang","Xiaodong Li","Yiqing Guo","Yan Chen","Feijiang Li","Peng Zhou","Yuhua Qian"],"pdf_url":"https://arxiv.org/pdf/2412.18207v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.01875v3","updated":"2024-12-24T06:19:38Z","published":"2024-03-04T09:31:56Z","title":"Locally Convex Global Loss Network for Decision-Focused Learning","summary":" In decision-making problems under uncertainty, predicting unknown parameters\nis often considered independent of the optimization part. Decision-focused\nlearning (DFL) is a task-oriented framework that integrates prediction and\noptimization by adapting the predictive model to give better decisions for the\ncorresponding task. Here, an inevitable challenge arises when computing the\ngradients of the optimal decision with respect to the parameters. Existing\nresearch copes with this issue by smoothly reforming surrogate optimization or\nconstructing surrogate loss functions that mimic task loss. However, they are\napplied to restricted optimization domains. In this paper, we propose Locally\nConvex Global Loss Network (LCGLN), a global surrogate loss model that can be\nimplemented in a general DFL paradigm. LCGLN learns task loss via a partial\ninput convex neural network which is guaranteed to be convex for chosen inputs\nwhile keeping the non-convex global structure for the other inputs. This\nenables LCGLN to admit general DFL through only a single surrogate loss without\nany sense for choosing appropriate parametric forms. We confirm the\neffectiveness and flexibility of LCGLN by evaluating our proposed model with\nthree stochastic decision-making problems.\n","authors":["Haeun Jeon","Hyunglip Bae","Minsu Park","Chanyeong Kim","Woo Chang Kim"],"pdf_url":"https://arxiv.org/pdf/2403.01875v3.pdf","comment":"AAAI-25"},{"id":"http://arxiv.org/abs/2309.13536v4","updated":"2024-12-24T06:18:23Z","published":"2023-09-24T03:19:40Z","title":"Tackling Intertwined Data and Device Heterogeneities in Federated\n Learning with Unlimited Staleness","summary":" Federated Learning (FL) can be affected by data and device heterogeneities,\ncaused by clients' different local data distributions and latencies in\nuploading model updates (i.e., staleness). Traditional schemes consider these\nheterogeneities as two separate and independent aspects, but this assumption is\nunrealistic in practical FL scenarios where these heterogeneities are\nintertwined. In these cases, traditional FL schemes are ineffective, and a\nbetter approach is to convert a stale model update into a unstale one. In this\npaper, we present a new FL framework that ensures the accuracy and\ncomputational efficiency of this conversion, hence effectively tackling the\nintertwined heterogeneities that may cause unlimited staleness in model\nupdates. Our basic idea is to estimate the distributions of clients' local\ntraining data from their uploaded stale model updates, and use these\nestimations to compute unstale client model updates. In this way, our approach\ndoes not require any auxiliary dataset nor the clients' local models to be\nfully trained, and does not incur any additional computation or communication\noverhead at client devices. We compared our approach with the existing FL\nstrategies on mainstream datasets and models, and showed that our approach can\nimprove the trained model accuracy by up to 25% and reduce the number of\nrequired training epochs by up to 35%. Source codes can be found at:\nhttps://github.com/pittisl/FL-with-intertwined-heterogeneity.\n","authors":["Haoming Wang","Wei Gao"],"pdf_url":"https://arxiv.org/pdf/2309.13536v4.pdf","comment":"22 pages. An abbreviated version is published at AAAI 2025"},{"id":"http://arxiv.org/abs/2408.16087v2","updated":"2024-12-24T06:17:53Z","published":"2024-08-28T18:34:54Z","title":"Unlocking Global Optimality in Bilevel Optimization: A Pilot Study","summary":" Bilevel optimization has witnessed a resurgence of interest, driven by its\ncritical role in trustworthy and efficient AI applications. While many recent\nworks have established convergence to stationary points or local minima,\nobtaining the global optimum of bilevel optimization remains an important yet\nopen problem. The difficulty lies in the fact that, unlike many prior\nnon-convex single-level problems, bilevel problems often do not admit a benign\nlandscape, and may indeed have multiple spurious local solutions. Nevertheless,\nattaining global optimality is indispensable for ensuring reliability, safety,\nand cost-effectiveness, particularly in high-stakes engineering applications\nthat rely on bilevel optimization. In this paper, we first explore the\nchallenges of establishing a global convergence theory for bilevel\noptimization, and present two sufficient conditions for global convergence. We\nprovide algorithm-dependent proofs to rigorously substantiate these sufficient\nconditions on two specific bilevel learning scenarios: representation learning\nand data hypercleaning (a.k.a. reweighting). Experiments corroborate the\ntheoretical findings, demonstrating convergence to the global minimum in both\ncases.\n","authors":["Quan Xiao","Tianyi Chen"],"pdf_url":"https://arxiv.org/pdf/2408.16087v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16220v2","updated":"2024-12-24T06:17:31Z","published":"2024-12-18T10:56:40Z","title":"Cross-Attention Graph Neural Networks for Inferring Gene Regulatory\n Networks with Skewed Degree Distribution","summary":" Inferencing Gene Regulatory Networks (GRNs) from gene expression data is a\npivotal challenge in systems biology, and several innovative computational\nmethods have been introduced. However, most of these studies have not\nconsidered the skewed degree distribution of genes. Specifically, some genes\nmay regulate multiple target genes while some genes may be regulated by\nmultiple regulator genes. Such a skewed degree distribution issue significantly\ncomplicates the application of directed graph embedding methods. To tackle this\nissue, we propose the Cross-Attention Complex Dual Graph Embedding Model\n(XATGRN). Our XATGRN employs a cross-attention mechanism to effectively capture\nintricate gene interactions from gene expression profiles. Additionally, it\nuses a Dual Complex Graph Embedding approach to manage the skewed degree\ndistribution, thereby ensuring precise prediction of regulatory relationships\nand their directionality. Our model consistently outperforms existing\nstate-of-the-art methods across various datasets, underscoring its efficacy in\nelucidating complex gene regulatory mechanisms. Our codes used in this paper\nare publicly available at: https://github.com/kikixiong/XATGRN.\n","authors":["Jiaqi Xiong","Nan Yin","Yifan Sun","Haoyang Li","Yingxu Wang","Duo Ai","Fang Pan","Shiyang Liang"],"pdf_url":"https://arxiv.org/pdf/2412.16220v2.pdf","comment":"11 pages, 6 figures,1 tabels"},{"id":"http://arxiv.org/abs/2412.18202v1","updated":"2024-12-24T06:14:34Z","published":"2024-12-24T06:14:34Z","title":"Developing Cryptocurrency Trading Strategy Based on Autoencoder-CNN-GANs\n Algorithms","summary":" This paper leverages machine learning algorithms to forecast and analyze\nfinancial time series. The process begins with a denoising autoencoder to\nfilter out random noise fluctuations from the main contract price data. Then,\none-dimensional convolution reduces the dimensionality of the filtered data and\nextracts key information. The filtered and dimensionality-reduced price data is\nfed into a GANs network, and its output serve as input of a fully connected\nnetwork. Through cross-validation, a model is trained to capture features that\nprecede large price fluctuations. The model predicts the likelihood and\ndirection of significant price changes in real-time price sequences, placing\ntrades at moments of high prediction accuracy. Empirical results demonstrate\nthat using autoencoders and convolution to filter and denoise financial data,\ncombined with GANs, achieves a certain level of predictive performance,\nvalidating the capabilities of machine learning algorithms to discover\nunderlying patterns in financial sequences. Keywords - CNN;GANs;\nCryptocurrency; Prediction.\n","authors":["Zhuohuan Hu","Richard Yu","Zizhou Zhang","Haoran Zheng","Qianying Liu","Yining Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.18202v1.pdf","comment":"The paper was accepted by 2024 4th International Conference on\n Artificial Intelligence, Robotics, and Communication(ICAIRC 2024)"},{"id":"http://arxiv.org/abs/2412.18199v1","updated":"2024-12-24T06:09:33Z","published":"2024-12-24T06:09:33Z","title":"Leveraging Deep Learning with Multi-Head Attention for Accurate\n Extraction of Medicine from Handwritten Prescriptions","summary":" Extracting medication names from handwritten doctor prescriptions is\nchallenging due to the wide variability in handwriting styles and prescription\nformats. This paper presents a robust method for extracting medicine names\nusing a combination of Mask R-CNN and Transformer-based Optical Character\nRecognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A\nnovel dataset, featuring diverse handwritten prescriptions from various regions\nof Pakistan, was utilized to fine-tune the model on different handwriting\nstyles. The Mask R-CNN model segments the prescription images to focus on the\nmedicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and\nPositional Embeddings, transcribes the isolated text. The transcribed text is\nthen matched against a pre-existing database for accurate identification. The\nproposed approach achieved a character error rate (CER) of 1.4% on standard\nbenchmarks, highlighting its potential as a reliable and efficient tool for\nautomating medicine name extraction.\n","authors":["Usman Ali","Sahil Ranmbail","Muhammad Nadeem","Hamid Ishfaq","Muhammad Umer Ramzan","Waqas Ali"],"pdf_url":"https://arxiv.org/pdf/2412.18199v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15639v2","updated":"2024-12-24T06:05:22Z","published":"2024-12-20T07:55:59Z","title":"Tacit Learning with Adaptive Information Selection for Cooperative\n Multi-Agent Reinforcement Learning","summary":" In multi-agent reinforcement learning (MARL), the centralized training with\ndecentralized execution (CTDE) framework has gained widespread adoption due to\nits strong performance. However, the further development of CTDE faces two key\nchallenges. First, agents struggle to autonomously assess the relevance of\ninput information for cooperative tasks, impairing their decision-making\nabilities. Second, in communication-limited scenarios with partial\nobservability, agents are unable to access global information, restricting\ntheir ability to collaborate effectively from a global perspective. To address\nthese challenges, we introduce a novel cooperative MARL framework based on\ninformation selection and tacit learning. In this framework, agents gradually\ndevelop implicit coordination during training, enabling them to infer the\ncooperative behavior of others in a discrete space without communication,\nrelying solely on local information. Moreover, we integrate gating and\nselection mechanisms, allowing agents to adaptively filter information based on\nenvironmental changes, thereby enhancing their decision-making capabilities.\nExperiments on popular MARL benchmarks show that our framework can be\nseamlessly integrated with state-of-the-art algorithms, leading to significant\nperformance improvements.\n","authors":["Lunjun Liu","Weilai Jiang","Yaonan Wang"],"pdf_url":"https://arxiv.org/pdf/2412.15639v2.pdf","comment":"Accepted by AAMAS 2025 (Extended Abstract)"},{"id":"http://arxiv.org/abs/2412.18196v1","updated":"2024-12-24T06:05:08Z","published":"2024-12-24T06:05:08Z","title":"Robustness-aware Automatic Prompt Optimization","summary":" The performance of Large Language Models (LLMs) is based on the quality of\nthe prompts and the semantic and structural integrity information of the input\ndata. However, current prompt generation methods primarily focus on generating\nprompts for clean input data, often overlooking the impact of perturbed inputs\non prompt performance. To address this limitation, we propose BATprompt (By\nAdversarial Training prompt), a novel method for prompt generation designed to\nwithstand input perturbations (such as typos in the input). Inspired by\nadversarial training techniques, BATprompt demonstrates strong performance on a\nvariety of perturbed tasks through a two-step process: adversarial perturbation\nand iterative optimization on unperturbed input via LLM. Unlike conventional\nadversarial attack methods, BATprompt avoids reliance on real gradients or\nmodel parameters. Instead, it leverages the advanced reasoning, language\nunderstanding and self reflection capabilities of LLMs to simulate gradients,\nguiding the generation of adversarial perturbations and optimizing prompt\nperformance. In our experiments, we evaluate BATprompt on multiple datasets\nacross both language understanding and generation tasks. The results indicate\nthat BATprompt outperforms existing prompt generation methods, delivering\nsuperior robustness and performance under diverse perturbation scenarios.\n","authors":["Zeru Shi","Zhenting Wang","Yongye Su","Weidi Luo","Fan Yang","Yongfeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18196v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2210.01708v5","updated":"2024-12-24T05:57:39Z","published":"2022-10-04T16:08:54Z","title":"Exploring Parameter-Efficient Fine-Tuning to Enable Foundation Models in\n Federated Learning","summary":" Federated learning (FL) has emerged as a promising paradigm for enabling the\ncollaborative training of models without centralized access to the raw data on\nlocal devices. In the typical FL paradigm (e.g., FedAvg), model weights are\nsent to and from the server each round to participating clients. Recently, the\nuse of small pre-trained models has been shown to be effective in federated\nlearning optimization and improving convergence. However, recent\nstate-of-the-art pre-trained models are getting more capable but also have more\nparameters, known as the \"Foundation Models.\" In conventional FL, sharing the\nenormous model weights can quickly put a massive communication burden on the\nsystem, especially if more capable models are employed. Can we find a solution\nto enable those strong and readily available pre-trained models in FL to\nachieve excellent performance while simultaneously reducing the communication\nburden? To this end, we investigate the use of parameter-efficient fine-tuning\nin federated learning and thus introduce a new framework: FedPEFT.\nSpecifically, we systemically evaluate the performance of FedPEFT across a\nvariety of client stability, data distribution, and differential privacy\nsettings. By only locally tuning and globally sharing a small portion of the\nmodel weights, significant reductions in the total communication overhead can\nbe achieved while maintaining competitive or even better performance in a wide\nrange of federated learning scenarios, providing insight into a new paradigm\nfor practical and effective federated systems.\n","authors":["Guangyu Sun","Umar Khalid","Matias Mendieta","Pu Wang","Chen Chen"],"pdf_url":"https://arxiv.org/pdf/2210.01708v5.pdf","comment":"Published in 2024 IEEE International Conference on Big Data"},{"id":"http://arxiv.org/abs/2405.05075v3","updated":"2024-12-24T05:55:43Z","published":"2024-05-08T14:18:13Z","title":"Sparse-PGD: A Unified Framework for Sparse Adversarial Perturbations\n Generation","summary":" This work studies sparse adversarial perturbations, including both\nunstructured and structured ones. We propose a framework based on a white-box\nPGD-like attack method named Sparse-PGD to effectively and efficiently generate\nsuch perturbations. Furthermore, we combine Sparse-PGD with a black-box attack\nto comprehensively and more reliably evaluate the models' robustness against\nunstructured and structured sparse adversarial perturbations. Moreover, the\nefficiency of Sparse-PGD enables us to conduct adversarial training to build\nrobust models against various sparse perturbations. Extensive experiments\ndemonstrate that our proposed attack algorithm exhibits strong performance in\ndifferent scenarios. More importantly, compared with other robust models, our\nadversarially trained model demonstrates state-of-the-art robustness against\nvarious sparse attacks.\n","authors":["Xuyang Zhong","Chen Liu"],"pdf_url":"https://arxiv.org/pdf/2405.05075v3.pdf","comment":"Extended version. Codes are available at\n https://github.com/CityU-MLO/sPGD"},{"id":"http://arxiv.org/abs/2412.16468v2","updated":"2024-12-24T05:54:15Z","published":"2024-12-21T03:51:04Z","title":"The Road to Artificial SuperIntelligence: A Comprehensive Survey of\n Superalignment","summary":" The emergence of large language models (LLMs) has sparked the possibility of\nabout Artificial Superintelligence (ASI), a hypothetical AI system surpassing\nhuman intelligence. However, existing alignment paradigms struggle to guide\nsuch advanced AI systems. Superalignment, the alignment of AI systems with\nhuman values and safety requirements at superhuman levels of capability aims to\naddresses two primary goals -- scalability in supervision to provide\nhigh-quality guidance signals and robust governance to ensure alignment with\nhuman values. In this survey, we examine scalable oversight methods and\npotential solutions for superalignment. Specifically, we explore the concept of\nASI, the challenges it poses, and the limitations of current alignment\nparadigms in addressing the superalignment problem. Then we review scalable\noversight methods for superalignment. Finally, we discuss the key challenges\nand propose pathways for the safe and continual improvement of ASI systems. By\ncomprehensively reviewing the current literature, our goal is provide a\nsystematical introduction of existing methods, analyze their strengths and\nlimitations, and discuss potential future directions.\n","authors":["HyunJin Kim","Xiaoyuan Yi","Jing Yao","Jianxun Lian","Muhua Huang","Shitong Duan","JinYeong Bak","Xing Xie"],"pdf_url":"https://arxiv.org/pdf/2412.16468v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.17139v2","updated":"2024-12-24T05:53:24Z","published":"2024-08-30T09:28:32Z","title":"Flow Matching for Optimal Reaction Coordinates of Biomolecular System","summary":" We present flow matching for reaction coordinates (FMRC), a novel deep\nlearning algorithm designed to identify optimal reaction coordinates (RC) in\nbiomolecular reversible dynamics. FMRC is based on the mathematical principles\nof lumpability and decomposability, which we reformulate into a conditional\nprobability framework for efficient data-driven optimization using deep\ngenerative models. While FMRC does not explicitly learn the well-established\ntransfer operator or its eigenfunctions, it can effectively encode the dynamics\nof leading eigenfunctions of the system transfer operator into its\nlow-dimensional RC space. We further quantitatively compare its performance\nwith several state-of-the-art algorithms by evaluating the quality of Markov\nstate models (MSM) constructed in their respective RC spaces, demonstrating the\nsuperiority of FMRC in three increasingly complex biomolecular systems. In\naddition, we successfully demonstrated the efficacy of FMRC for bias deposition\nin the enhanced sampling of a simple model system. Finally, we discuss its\npotential applications in downstream applications such as enhanced sampling\nmethods and MSM construction.\n","authors":["Mingyuan Zhang","Zhicheng Zhang","Hao Wu","Yong Wang"],"pdf_url":"https://arxiv.org/pdf/2408.17139v2.pdf","comment":"For Supporting Information, please see\n https://pubs.acs.org/doi/full/10.1021/acs.jctc.4c01139"},{"id":"http://arxiv.org/abs/2412.18187v1","updated":"2024-12-24T05:47:08Z","published":"2024-12-24T05:47:08Z","title":"Learning Sign Language Representation using CNN LSTM, 3DCNN, CNN RNN\n LSTM and CCN TD","summary":" Existing Sign Language Learning applications focus on the demonstration of\nthe sign in the hope that the student will copy a sign correctly. In these\ncases, only a teacher can confirm that the sign was completed correctly, by\nreviewing a video captured manually. Sign Language Translation is a widely\nexplored field in visual recognition. This paper seeks to explore the\nalgorithms that will allow for real-time, video sign translation, and grading\nof sign language accuracy for new sign language users. This required algorithms\ncapable of recognizing and processing spatial and temporal features. The aim of\nthis paper is to evaluate and identify the best neural network algorithm that\ncan facilitate a sign language tuition system of this nature. Modern popular\nalgorithms including CNN and 3DCNN are compared on a dataset not yet explored,\nTrinidad and Tobago Sign Language as well as an American Sign Language dataset.\nThe 3DCNN algorithm was found to be the best performing neural network\nalgorithm from these systems with 91% accuracy in the TTSL dataset and 83%\naccuracy in the ASL dataset.\n","authors":["Nikita Louison","Wayne Goodridge","Koffka Khan"],"pdf_url":"https://arxiv.org/pdf/2412.18187v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2412.18184v1","updated":"2024-12-24T05:38:01Z","published":"2024-12-24T05:38:01Z","title":"Unified Stochastic Framework for Neural Network Quantization and Pruning","summary":" Quantization and pruning are two essential techniques for compressing neural\nnetworks, yet they are often treated independently, with limited theoretical\nanalysis connecting them. This paper introduces a unified framework for\npost-training quantization and pruning using stochastic path-following\nalgorithms. Our approach builds on the Stochastic Path Following Quantization\n(SPFQ) method, extending its applicability to pruning and low-bit quantization,\nincluding challenging 1-bit regimes. By incorporating a scaling parameter and\ngeneralizing the stochastic operator, the proposed method achieves robust error\ncorrection and yields rigorous theoretical error bounds for both quantization\nand pruning as well as their combination.\n","authors":["Haoyu Zhang","Rayan Saab"],"pdf_url":"https://arxiv.org/pdf/2412.18184v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2412.18180v1","updated":"2024-12-24T05:34:05Z","published":"2024-12-24T05:34:05Z","title":"PCM Selector: Penalized Covariate-Mediator Selection Operator for\n Evaluating Linear Causal Effects","summary":" For a data-generating process for random variables that can be described with\na linear structural equation model, we consider a situation in which (i) a set\nof covariates satisfying the back-door criterion cannot be observed or (ii)\nsuch a set can be observed, but standard statistical estimation methods cannot\nbe applied to estimate causal effects because of\nmulticollinearity/high-dimensional data problems. We propose a novel two-stage\npenalized regression approach, the penalized covariate-mediator selection\noperator (PCM Selector), to estimate the causal effects in such scenarios.\nUnlike existing penalized regression analyses, when a set of intermediate\nvariables is available, PCM Selector provides a consistent or less biased\nestimator of the causal effect. In addition, PCM Selector provides a variable\nselection procedure for intermediate variables to obtain better estimation\naccuracy of the causal effects than does the back-door criterion.\n","authors":["Hisayoshi Nanmo","Manabu Kuroki"],"pdf_url":"https://arxiv.org/pdf/2412.18180v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18177v1","updated":"2024-12-24T05:25:21Z","published":"2024-12-24T05:25:21Z","title":"Enhancing Online Continual Learning with Plug-and-Play State Space Model\n and Class-Conditional Mixture of Discretization","summary":" Online continual learning (OCL) seeks to learn new tasks from data streams\nthat appear only once, while retaining knowledge of previously learned tasks.\nMost existing methods rely on replay, focusing on enhancing memory retention\nthrough regularization or distillation. However, they often overlook the\nadaptability of the model, limiting the ability to learn generalizable and\ndiscriminative features incrementally from online training data. To address\nthis, we introduce a plug-and-play module, S6MOD, which can be integrated into\nmost existing methods and directly improve adaptability. Specifically, S6MOD\nintroduces an extra branch after the backbone, where a mixture of\ndiscretization selectively adjusts parameters in a selective state space model,\nenriching selective scan patterns such that the model can adaptively select the\nmost sensitive discretization method for current dynamics. We further design a\nclass-conditional routing algorithm for dynamic, uncertainty-based adjustment\nand implement a contrastive discretization loss to optimize it. Extensive\nexperiments combining our module with various models demonstrate that S6MOD\nsignificantly enhances model adaptability, leading to substantial performance\ngains and achieving the state-of-the-art results.\n","authors":["Sihao Liu","Yibo Yang","Xiaojie Li","David A. Clifton","Bernard Ghanem"],"pdf_url":"https://arxiv.org/pdf/2412.18177v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.20898v2","updated":"2024-12-24T05:22:40Z","published":"2024-10-28T10:26:19Z","title":"Diff-Instruct*: Towards Human-Preferred One-step Text-to-image\n Generative Models","summary":" In this paper, we introduce the Diff-Instruct* (DI*), an image data-free\napproach for building one-step text-to-image generative models that align with\nhuman preference while maintaining the ability to generate highly realistic\nimages. We frame human preference alignment as online reinforcement learning\nusing human feedback (RLHF), where the goal is to maximize the reward function\nwhile regularizing the generator distribution to remain close to a reference\ndiffusion process. Unlike traditional RLHF approaches, which rely on the KL\ndivergence for regularization, we introduce a novel score-based divergence\nregularization, which leads to significantly better performances. Although the\ndirect calculation of this preference alignment objective remains intractable,\nwe demonstrate that we can efficiently compute its gradient by deriving an\nequivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to\ntrain a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step\ntext-to-image model, which can generate images of a resolution of 1024x1024\nwith only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference\ntime and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly\nin PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1\non Human Preference Score benchmark, establishing a new state-of-the-art\nbenchmark of human-preferred 1-step text-to-image generative models. Besides\nthe strong quantitative performances, extensive qualitative comparisons also\nconfirm the advantages of DI* in terms of maintaining diversity, improving\nimage layouts, and enhancing aesthetic colors. We have released our\nindustry-ready model on the homepage:\n\\url{https://github.com/pkulwj1994/diff_instruct_star}.\n","authors":["Weijian Luo","Colin Zhang","Debing Zhang","Zhengyang Geng"],"pdf_url":"https://arxiv.org/pdf/2410.20898v2.pdf","comment":"revision: 2.6B 1-step text-to-image model outperforms 12B\n Flux-dev-50step model in human preferences"},{"id":"http://arxiv.org/abs/2103.04021v3","updated":"2024-12-24T05:20:59Z","published":"2021-03-06T03:57:46Z","title":"Asymptotic Theory for IV-Based Reinforcement Learning with Potential\n Endogeneity","summary":" In the standard data analysis framework, data is collected (once and for\nall), and then data analysis is carried out. However, with the advancement of\ndigital technology, decision-makers constantly analyze past data and generate\nnew data through their decisions. We model this as a Markov decision process\nand show that the dynamic interaction between data generation and data analysis\nleads to a new type of bias -- reinforcement bias -- that exacerbates the\nendogeneity problem in standard data analysis. We propose a class of instrument\nvariable (IV)-based reinforcement learning (RL) algorithms to correct for the\nbias and establish their theoretical properties by incorporating them into a\nstochastic approximation (SA) framework. Our analysis accommodates\niterate-dependent Markovian structures and, therefore, can be used to study RL\nalgorithms with policy improvement. We also provide formulas for inference on\noptimal policies of the IV-RL algorithms. These formulas highlight how\nintertemporal dependencies of the Markovian environment affect the inference.\n","authors":["Jin Li","Ye Luo","Zigan Wang","Xiaowei Zhang"],"pdf_url":"https://arxiv.org/pdf/2103.04021v3.pdf","comment":"main body: 42 pages; supplemental material: 14 pages"},{"id":"http://arxiv.org/abs/2405.11573v2","updated":"2024-12-24T05:16:49Z","published":"2024-05-19T14:42:19Z","title":"Quantile Activation: Correcting a Failure Mode of ML Models","summary":" An established failure mode for machine learning models occurs when the same\nfeatures are equally likely to belong to class 0 and class 1. In such cases,\nexisting ML models cannot correctly classify the sample. However, a solvable\ncase emerges when the probabilities of class 0 and 1 vary with the context\ndistribution. To the best of our knowledge, standard neural network\narchitectures like MLPs or CNNs are not equipped to handle this.\n In this article, we propose a simple activation function, quantile activation\n(QACT), that addresses this problem without significantly increasing\ncomputational costs. The core idea is to adapt the outputs of each neuron to\nits context distribution. The proposed quantile activation, QACT, produces the\nrelative quantile of the sample in its context distribution, rather than the\nactual values, as in traditional networks.\n A practical example where the same sample can have different labels arises in\ncases of inherent distribution shift. We validate the proposed activation\nfunction under such shifts, using datasets designed to test robustness against\ndistortions : CIFAR10C, CIFAR100C, MNISTC, TinyImagenetC. Our results\ndemonstrate significantly better generalization across distortions compared to\nconventional classifiers, across various architectures. Although this paper\npresents a proof of concept, we find that this approach unexpectedly\noutperforms DINOv2 (small) under large distortions, despite DINOv2 being\ntrained with a much larger network and dataset.\n","authors":["Aditya Challa","Sravan Danda","Laurent Najman","Snehanshu Saha"],"pdf_url":"https://arxiv.org/pdf/2405.11573v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.14919v4","updated":"2024-12-24T05:06:20Z","published":"2024-10-19T00:33:51Z","title":"Adversarial Score identity Distillation: Rapidly Surpassing the Teacher\n in One Step","summary":" Score identity Distillation (SiD) is a data-free method that has achieved\nSOTA performance in image generation by leveraging only a pretrained diffusion\nmodel, without requiring any training data. However, its ultimate performance\nis constrained by how accurate the pretrained model captures the true data\nscores at different stages of the diffusion process. In this paper, we\nintroduce SiDA (SiD with Adversarial Loss), which not only enhances generation\nquality but also improves distillation efficiency by incorporating real images\nand adversarial loss. SiDA utilizes the encoder from the generator's score\nnetwork as a discriminator, allowing it to distinguish between real images and\nthose generated by SiD. The adversarial loss is batch-normalized within each\nGPU and then combined with the original SiD loss. This integration effectively\nincorporates the average \"fakeness\" per GPU batch into the pixel-based SiD\nloss, enabling SiDA to distill a single-step generator. SiDA converges\nsignificantly faster than its predecessor when distilled from scratch, and\nswiftly improves upon the original model's performance during fine-tuning from\na pre-distilled SiD generator. This one-step adversarial distillation method\nestablishes new benchmarks in generation performance when distilling EDM\ndiffusion models, achieving FID scores of 1.110 on ImageNet 64x64. When\ndistilling EDM2 models trained on ImageNet 512x512, our SiDA method surpasses\neven the largest teacher model, EDM2-XXL, which achieved an FID of 1.81 using\nclassifier-free guidance (CFG) and 63 generation steps. In contrast, SiDA\nachieves FID scores of 2.156 for size XS, 1.669 for S, 1.488 for M, 1.413 for\nL, 1.379 for XL, and 1.366 for XXL, all without CFG and in a single generation\nstep. These results highlight substantial improvements across all model sizes.\nOur code is available at https://github.com/mingyuanzhou/SiD/tree/sida.\n","authors":["Mingyuan Zhou","Huangjie Zheng","Yi Gu","Zhendong Wang","Hai Huang"],"pdf_url":"https://arxiv.org/pdf/2410.14919v4.pdf","comment":"10 pages (main text), 34 figures, and 10 tables"},{"id":"http://arxiv.org/abs/2412.16098v2","updated":"2024-12-24T05:04:52Z","published":"2024-12-20T17:41:11Z","title":"Explainable AI for Multivariate Time Series Pattern Exploration: Latent\n Space Visual Analytics with Temporal Fusion Transformer and Variational\n Autoencoders in Power Grid Event Diagnosis","summary":" Detecting and analyzing complex patterns in multivariate time-series data is\ncrucial for decision-making in urban and environmental system operations.\nHowever, challenges arise from the high dimensionality, intricate complexity,\nand interconnected nature of complex patterns, which hinder the understanding\nof their underlying physical processes. Existing AI methods often face\nlimitations in interpretability, computational efficiency, and scalability,\nreducing their applicability in real-world scenarios. This paper proposes a\nnovel visual analytics framework that integrates two generative AI models,\nTemporal Fusion Transformer (TFT) and Variational Autoencoders (VAEs), to\nreduce complex patterns into lower-dimensional latent spaces and visualize them\nin 2D using dimensionality reduction techniques such as PCA, t-SNE, and UMAP\nwith DBSCAN. These visualizations, presented through coordinated and\ninteractive views and tailored glyphs, enable intuitive exploration of complex\nmultivariate temporal patterns, identifying patterns' similarities and uncover\ntheir potential correlations for a better interpretability of the AI outputs.\nThe framework is demonstrated through a case study on power grid signal data,\nwhere it identifies multi-label grid event signatures, including faults and\nanomalies with diverse root causes. Additionally, novel metrics and\nvisualizations are introduced to validate the models and evaluate the\nperformance, efficiency, and consistency of latent maps generated by TFT and\nVAE under different configurations. These analyses provide actionable insights\nfor model parameter tuning and reliability improvements. Comparative results\nhighlight that TFT achieves shorter run times and superior scalability to\ndiverse time-series data shapes compared to VAE. This work advances fault\ndiagnosis in multivariate time series, fostering explainable AI to support\ncritical system operations.\n","authors":["Haowen Xu","Ali Boyaci","Jianming Lian","Aaron Wilson"],"pdf_url":"https://arxiv.org/pdf/2412.16098v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10142v3","updated":"2024-12-24T04:56:07Z","published":"2024-02-15T17:48:58Z","title":"Tracking Changing Probabilities via Dynamic Learners","summary":" Consider a predictor, a learner, whose input is a stream of discrete items.\nThe predictor's task, at every time point, is probabilistic multiclass\nprediction, i.e. to predict which item may occur next by outputting zero or\nmore candidate items, each with a probability, after which the actual item is\nrevealed and the predictor updates. To output probabilities, the predictor\nkeeps track of the proportions of the items it has seen. The stream is\nunbounded (lifelong), and the predictor has finite limited space. The task is\nopen-ended: the set of items is unknown to the predictor and their totality can\nalso grow unbounded. Moreover, there is non-stationarity: the underlying\nfrequencies of items may change, substantially, from time to time. For\ninstance, new items may start appearing and a few recently frequent items may\ncease to occur again. The predictor, being space-bounded, need only provide\nprobabilities for those items which, at the time of prediction, have\nsufficiently high frequency, i.e., the salient items. This problem is motivated\nin the setting of Prediction Games, a self-supervised learning regime where\nconcepts serve as both the predictors and the predictands, and the set of\nconcepts grows over time, resulting in non-stationarities as new concepts are\ngenerated and used. We design and study a number of predictors, sparse moving\naverages(SMAs), for the task. One SMA adapts the sparse exponentiated moving\naverage and another is based on queuing a few counts, keeping dynamic per-item\nhistories. Evaluating the predicted probabilities, under noise and\nnon-stationarity, presents challenges, and we discuss and develop evaluation\nmethods, one based on bounding log-loss. We show that a combination of ideas,\nsupporting dynamic predictand-specific learning rates, offers advantages in\nterms of faster adaption to change (plasticity), while also supporting low\nvariance (stability).\n","authors":["Omid Madani"],"pdf_url":"https://arxiv.org/pdf/2402.10142v3.pdf","comment":"69 pages, 30 figures, 18 tables"},{"id":"http://arxiv.org/abs/2412.18164v1","updated":"2024-12-24T04:55:46Z","published":"2024-12-24T04:55:46Z","title":"Stochastic Control for Fine-tuning Diffusion Models: Optimality,\n Regularity, and Convergence","summary":" Diffusion models have emerged as powerful tools for generative modeling,\ndemonstrating exceptional capability in capturing target data distributions\nfrom large datasets. However, fine-tuning these massive models for specific\ndownstream tasks, constraints, and human preferences remains a critical\nchallenge. While recent advances have leveraged reinforcement learning\nalgorithms to tackle this problem, much of the progress has been empirical,\nwith limited theoretical understanding. To bridge this gap, we propose a\nstochastic control framework for fine-tuning diffusion models. Building on\ndenoising diffusion probabilistic models as the pre-trained reference dynamics,\nour approach integrates linear dynamics control with Kullback-Leibler\nregularization. We establish the well-posedness and regularity of the\nstochastic control problem and develop a policy iteration algorithm (PI-FT) for\nnumerical solution. We show that PI-FT achieves global convergence at a linear\nrate. Unlike existing work that assumes regularities throughout training, we\nprove that the control and value sequences generated by the algorithm maintain\nthe regularity. Additionally, we explore extensions of our framework to\nparametric settings and continuous-time formulations.\n","authors":["Yinbin Han","Meisam Razaviyayn","Renyuan Xu"],"pdf_url":"https://arxiv.org/pdf/2412.18164v1.pdf","comment":"28 pages"},{"id":"http://arxiv.org/abs/2412.15703v3","updated":"2024-12-24T04:42:00Z","published":"2024-12-20T09:26:41Z","title":"MacLight: Multi-scene Aggregation Convolutional Learning for Traffic\n Signal Control","summary":" Reinforcement learning methods have proposed promising traffic signal control\npolicy that can be trained on large road networks. Current SOTA methods model\nroad networks as topological graph structures, incorporate graph attention into\ndeep Q-learning, and merge local and global embeddings to improve policy.\nHowever, graph-based methods are difficult to parallelize, resulting in huge\ntime overhead. Moreover, none of the current peer studies have deployed dynamic\ntraffic systems for experiments, which is far from the actual situation.\n In this context, we propose Multi-Scene Aggregation Convolutional Learning\nfor traffic signal control (MacLight), which offers faster training speeds and\nmore stable performance. Our approach consists of two main components. The\nfirst is the global representation, where we utilize variational autoencoders\nto compactly compress and extract the global representation. The second\ncomponent employs the proximal policy optimization algorithm as the backbone,\nallowing value evaluation to consider both local features and global embedding\nrepresentations. This backbone model significantly reduces time overhead and\nensures stability in policy updates. We validated our method across multiple\ntraffic scenarios under both static and dynamic traffic systems. Experimental\nresults demonstrate that, compared to general and domian SOTA methods, our\napproach achieves superior stability, optimized convergence levels and the\nhighest time efficiency. The code is under\nhttps://github.com/Aegis1863/MacLight.\n","authors":["Sunbowen Lee","Hongqin Lyu","Yicheng Gong","Yingying Sun","Chao Deng"],"pdf_url":"https://arxiv.org/pdf/2412.15703v3.pdf","comment":"Accepted as full paper by AAMAS2025"},{"id":"http://arxiv.org/abs/2211.00181v4","updated":"2024-12-24T04:28:34Z","published":"2022-10-31T22:51:59Z","title":"The Numerical Stability of Hyperbolic Representation Learning","summary":" Given the exponential growth of the volume of the ball w.r.t. its radius, the\nhyperbolic space is capable of embedding trees with arbitrarily small\ndistortion and hence has received wide attention for representing hierarchical\ndatasets. However, this exponential growth property comes at a price of\nnumerical instability such that training hyperbolic learning models will\nsometimes lead to catastrophic NaN problems, encountering unrepresentable\nvalues in floating point arithmetic. In this work, we carefully analyze the\nlimitation of two popular models for the hyperbolic space, namely, the\nPoincar\\'e ball and the Lorentz model. We first show that, under the 64 bit\narithmetic system, the Poincar\\'e ball has a relatively larger capacity than\nthe Lorentz model for correctly representing points. Then, we theoretically\nvalidate the superiority of the Lorentz model over the Poincar\\'e ball from the\nperspective of optimization. Given the numerical limitations of both models, we\nidentify one Euclidean parametrization of the hyperbolic space which can\nalleviate these limitations. We further extend this Euclidean parametrization\nto hyperbolic hyperplanes and exhibits its ability in improving the performance\nof hyperbolic SVM.\n","authors":["Gal Mishne","Zhengchao Wan","Yusu Wang","Sheng Yang"],"pdf_url":"https://arxiv.org/pdf/2211.00181v4.pdf","comment":"update funding info"},{"id":"http://arxiv.org/abs/2412.17153v2","updated":"2024-12-24T04:21:15Z","published":"2024-12-22T20:21:54Z","title":"Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models\n with Flow Matching","summary":" Autoregressive (AR) models have achieved state-of-the-art performance in text\nand image generation but suffer from slow generation due to the token-by-token\nprocess. We ask an ambitious question: can a pre-trained AR model be adapted to\ngenerate outputs in just one or two steps? If successful, this would\nsignificantly advance the development and deployment of AR models. We notice\nthat existing works that try to speed up AR generation by generating multiple\ntokens at once fundamentally cannot capture the output distribution due to the\nconditional dependencies between tokens, limiting their effectiveness for\nfew-step generation. To address this, we propose Distilled Decoding (DD), which\nuses flow matching to create a deterministic mapping from Gaussian distribution\nto the output distribution of the pre-trained AR model. We then train a network\nto distill this mapping, enabling few-step generation. DD doesn't need the\ntraining data of the original AR model, making it more practical. We evaluate\nDD on state-of-the-art image AR models and present promising results on\nImageNet-256. For VAR, which requires 10-step generation, DD enables one-step\ngeneration (6.3$\\times$ speed-up), with an acceptable increase in FID from 4.19\nto 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an\n217.8$\\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In\nboth cases, baseline methods completely fail with FID>100. DD also excels on\ntext-to-image generation, reducing the generation from 256 steps to 2 for\nLlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to\ndemonstrate the possibility of one-step generation for image AR models, DD\nchallenges the prevailing notion that AR models are inherently slow, and opens\nup new opportunities for efficient AR generation. The project website is at\nhttps://imagination-research.github.io/distilled-decoding.\n","authors":["Enshu Liu","Xuefei Ning","Yu Wang","Zinan Lin"],"pdf_url":"https://arxiv.org/pdf/2412.17153v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.14074v3","updated":"2024-12-24T04:05:27Z","published":"2024-01-25T10:52:36Z","title":"ProCNS: Progressive Prototype Calibration and Noise Suppression for\n Weakly-Supervised Medical Image Segmentation","summary":" Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate\nthe conflict between annotation cost and model performance by adopting sparse\nannotation formats (e.g., point, scribble, block, etc.). Typical approaches\nattempt to exploit anatomy and topology priors to directly expand sparse\nannotations into pseudo-labels. However, due to a lack of attention to the\nambiguous edges in medical images and insufficient exploration of sparse\nsupervision, existing approaches tend to generate erroneous and overconfident\npseudo proposals in noisy regions, leading to cumulative model error and\nperformance degradation. In this work, we propose a novel WSS approach, named\nProCNS, encompassing two synergistic modules devised with the principles of\nprogressive prototype calibration and noise suppression. Specifically, we\ndesign a Prototype-based Regional Spatial Affinity (PRSA) loss to maximize the\npair-wise affinities between spatial and semantic elements, providing our model\nof interest with more reliable guidance. The affinities are derived from the\ninput images and the prototype-refined predictions. Meanwhile, we propose an\nAdaptive Noise Perception and Masking (ANPM) module to obtain more enriched and\nrepresentative prototype representations, which adaptively identifies and masks\nnoisy regions within the pseudo proposals, reducing potential erroneous\ninterference during prototype computation. Furthermore, we generate specialized\nsoft pseudo-labels for the noisy regions identified by ANPM, providing\nsupplementary supervision. Extensive experiments on six medical image\nsegmentation tasks involving different modalities demonstrate that the proposed\nframework significantly outperforms representative state-of-the-art methods.\n","authors":["Y. Liu","L. Lin","K. K. Y. Wong","X. Tang"],"pdf_url":"https://arxiv.org/pdf/2401.14074v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18144v1","updated":"2024-12-24T03:56:25Z","published":"2024-12-24T03:56:25Z","title":"Neural Conformal Control for Time Series Forecasting","summary":" We introduce a neural network conformal prediction method for time series\nthat enhances adaptivity in non-stationary environments. Our approach acts as a\nneural controller designed to achieve desired target coverage, leveraging\nauxiliary multi-view data with neural network encoders in an end-to-end manner\nto further enhance adaptivity. Additionally, our model is designed to enhance\nthe consistency of prediction intervals in different quantiles by integrating\nmonotonicity constraints and leverages data from related tasks to boost\nfew-shot learning performance. Using real-world datasets from epidemics,\nelectric demand, weather, and others, we empirically demonstrate significant\nimprovements in coverage and probabilistic accuracy, and find that our method\nis the only one that combines good calibration with consistency in prediction\nintervals.\n","authors":["Ruipu Li","Alexander Rodríguez"],"pdf_url":"https://arxiv.org/pdf/2412.18144v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18140v1","updated":"2024-12-24T03:53:57Z","published":"2024-12-24T03:53:57Z","title":"An Instrumental Value for Data Production and its Application to Data\n Pricing","summary":" How much value does a dataset or a data production process have to an agent\nwho wishes to use the data to assist decision-making? This is a fundamental\nquestion towards understanding the value of data as well as further pricing of\ndata. This paper develops an approach for capturing the instrumental value of\ndata production processes, which takes two key factors into account: (a) the\ncontext of the agent's decision-making problem; (b) prior data or information\nthe agent already possesses. We ''micro-found'' our valuation concepts by\nshowing how they connect to classic notions of information design and signals\nin information economics. When instantiated in the domain of Bayesian linear\nregression, our value naturally corresponds to information gain. Based on our\ndesigned data value, we then study a basic monopoly pricing setting with a\nbuyer looking to purchase from a seller some labeled data of a certain feature\ndirection in order to improve a Bayesian regression model. We show that when\nthe seller has the ability to fully customize any data request, she can extract\nthe first-best revenue (i.e., full surplus) from any population of buyers,\ni.e., achieving first-degree price discrimination. If the seller can only sell\ndata that are derived from an existing data pool, this limits her ability to\ncustomize, and achieving first-best revenue becomes generally impossible.\nHowever, we design a mechanism that achieves seller revenue at most $\\log\n(\\kappa)$ less than the first-best revenue, where $\\kappa$ is the condition\nnumber associated with the data matrix. A corollary of this result is that the\nseller can extract the first-best revenue in the multi-armed bandits special\ncase.\n","authors":["Rui Ai","Boxiang Lyu","Zhaoran Wang","Zhuoran Yang","Haifeng Xu"],"pdf_url":"https://arxiv.org/pdf/2412.18140v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18138v1","updated":"2024-12-24T03:49:48Z","published":"2024-12-24T03:49:48Z","title":"Fundamental Limits in the Search for Less Discriminatory Algorithms --\n and How to Avoid Them","summary":" Disparate impact doctrine offers an important legal apparatus for targeting\nunfair data-driven algorithmic decisions. A recent body of work has focused on\nconceptualizing and operationalizing one particular construct from this\ndoctrine -- the less discriminatory alternative, an alternative policy that\nreduces disparities while meeting the same business needs of a status quo or\nbaseline policy. This paper puts forward four fundamental results, which each\nrepresent limits to searching for and using less discriminatory algorithms\n(LDAs). (1) Statistically, although LDAs are almost always identifiable in\nretrospect on fixed populations, making conclusions about how alternative\nclassifiers perform on an unobserved distribution is more difficult. (2)\nMathematically, a classifier can only exhibit certain combinations of accuracy\nand selection rate disparity between groups, given the size of each group and\nthe base rate of the property or outcome of interest in each group. (3)\nComputationally, a search for a lower-disparity classifier at some baseline\nlevel of utility is NP-hard. (4) From a modeling and consumer welfare\nperspective, defining an LDA only in terms of business needs can lead to LDAs\nthat leave consumers strictly worse off, including members of the disadvantaged\ngroup. These findings, which may seem on their face to give firms strong\ndefenses against discrimination claims, only tell part of the story. For each\nof our negative results limiting what is attainable in this setting, we offer\npositive results demonstrating that there exist effective and low-cost\nstrategies that are remarkably effective at identifying viable lower-disparity\npolicies.\n","authors":["Benjamin Laufer","Manisch Raghavan","Solon Barocas"],"pdf_url":"https://arxiv.org/pdf/2412.18138v1.pdf","comment":"23 pages, 4 figures, 1 table. Prior versions appeared at NeurIPS\n Algorithmic Fairness Through the Lens of Metrics and Evaluation Workshop\n (AFME 2024) and Regulatable ML Workshop (RegML 2024). Forthcoming at ACM\n CS&Law 2025"},{"id":"http://arxiv.org/abs/2409.03005v2","updated":"2024-12-24T03:49:18Z","published":"2024-09-04T18:01:10Z","title":"PIETRA: Physics-Informed Evidential Learning for Traversing\n Out-of-Distribution Terrain","summary":" Self-supervised learning is a powerful approach for developing traversability\nmodels for off-road navigation, but these models often struggle with inputs\nunseen during training. Existing methods utilize techniques like evidential\ndeep learning to quantify model uncertainty, helping to identify and avoid\nout-of-distribution terrain. However, always avoiding out-of-distribution\nterrain can be overly conservative, e.g., when novel terrain can be effectively\nanalyzed using a physics-based model. To overcome this challenge, we introduce\nPhysics-Informed Evidential Traversability (PIETRA), a self-supervised learning\nframework that integrates physics priors directly into the mathematical\nformulation of evidential neural networks and introduces physics knowledge\nimplicitly through an uncertainty-aware, physics-informed training loss. Our\nevidential network seamlessly transitions between learned and physics-based\npredictions for out-of-distribution inputs. Additionally, the physics-informed\nloss regularizes the learned model, ensuring better alignment with the physics\nmodel. Extensive simulations and hardware experiments demonstrate that PIETRA\nimproves both learning accuracy and navigation performance in environments with\nsignificant distribution shifts.\n","authors":["Xiaoyi Cai","James Queeney","Tong Xu","Aniket Datar","Chenhui Pan","Max Miller","Ashton Flather","Philip R. Osteen","Nicholas Roy","Xuesu Xiao","Jonathan P. How"],"pdf_url":"https://arxiv.org/pdf/2409.03005v2.pdf","comment":"To appear in RA-L. Video: https://youtu.be/OTnNZ96oJRk"},{"id":"http://arxiv.org/abs/2412.13231v3","updated":"2024-12-24T03:46:32Z","published":"2024-12-17T13:42:49Z","title":"C2F-TP: A Coarse-to-Fine Denoising Framework for Uncertainty-Aware\n Trajectory Prediction","summary":" Accurately predicting the trajectory of vehicles is critically important for\nensuring safety and reliability in autonomous driving. Although considerable\nresearch efforts have been made recently, the inherent trajectory uncertainty\ncaused by various factors including the dynamic driving intends and the diverse\ndriving scenarios still poses significant challenges to accurate trajectory\nprediction. To address this issue, we propose C2F-TP, a coarse-to-fine\ndenoising framework for uncertainty-aware vehicle trajectory prediction. C2F-TP\nfeatures an innovative two-stage coarse-to-fine prediction process.\nSpecifically, in the spatial-temporal interaction stage, we propose a\nspatial-temporal interaction module to capture the inter-vehicle interactions\nand learn a multimodal trajectory distribution, from which a certain number of\nnoisy trajectories are sampled. Next, in the trajectory refinement stage, we\ndesign a conditional denoising model to reduce the uncertainty of the sampled\ntrajectories through a step-wise denoising operation. Extensive experiments are\nconducted on two real datasets NGSIM and highD that are widely adopted in\ntrajectory prediction. The result demonstrates the effectiveness of our\nproposal.\n","authors":["Zichen Wang","Hao Miao","Senzhang Wang","Renzhi Wang","Jianxin Wang","Jian Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.13231v3.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.15295v2","updated":"2024-12-24T03:43:45Z","published":"2024-12-19T09:03:39Z","title":"Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and\n Implementation","summary":" Clustering is a key task in machine learning, with $k$-means being widely\nused for its simplicity and effectiveness. While 1D clustering is common,\nexisting methods often fail to exploit the structure of 1D data, leading to\ninefficiencies. This thesis introduces optimized algorithms for $k$-means++\ninitialization and Lloyd's algorithm, leveraging sorted data, prefix sums, and\nbinary search for improved computational performance. The main contributions\nare: (1) an optimized $k$-cluster algorithm achieving $O(l \\cdot k^2 \\cdot \\log\nn)$ complexity for greedy $k$-means++ initialization and $O(i \\cdot k \\cdot\n\\log n)$ for Lloyd's algorithm, where $l$ is the number of greedy $k$-means++\nlocal trials, and $i$ is the number of Lloyd's algorithm iterations, and (2) a\nbinary search-based two-cluster algorithm, achieving $O(\\log n)$ runtime with\ndeterministic convergence to a Lloyd's algorithm local minimum. Benchmarks\ndemonstrate over a 4500x speedup compared to scikit-learn for large datasets\nwhile maintaining clustering quality measured by within-cluster sum of squares\n(WCSS). Additionally, the algorithms achieve a 300x speedup in an LLM\nquantization task, highlighting their utility in emerging applications. This\nthesis bridges theory and practice for 1D $k$-means clustering, delivering\nefficient and sound algorithms implemented in a JIT-optimized open-source\nPython library.\n","authors":["Jake Hyun"],"pdf_url":"https://arxiv.org/pdf/2412.15295v2.pdf","comment":"Undergraduate thesis, Department of Computer Science and Engineering,\n Seoul National University. Minor revisions incorporated post-submission"},{"id":"http://arxiv.org/abs/2412.18134v1","updated":"2024-12-24T03:42:53Z","published":"2024-12-24T03:42:53Z","title":"Learning Randomized Reductions and Program Properties","summary":" The correctness of computations remains a significant challenge in computer\nscience, with traditional approaches relying on automated testing or formal\nverification. Self-testing/correcting programs introduce an alternative\nparadigm, allowing a program to verify and correct its own outputs via\nrandomized reductions, a concept that previously required manual derivation. In\nthis paper, we present Bitween, a method and tool for automated learning of\nrandomized (self)-reductions and program properties in numerical programs.\nBitween combines symbolic analysis and machine learning, with a surprising\nfinding: polynomial-time linear regression, a basic optimization method, is not\nonly sufficient but also highly effective for deriving complex randomized\nself-reductions and program invariants, often outperforming sophisticated\nmixed-integer linear programming solvers. We establish a theoretical framework\nfor learning these reductions and introduce RSR-Bench, a benchmark suite for\nevaluating Bitween's capabilities on scientific and machine learning functions.\nOur empirical results show that Bitween surpasses state-of-the-art tools in\nscalability, stability, and sample efficiency when evaluated on nonlinear\ninvariant benchmarks like NLA-DigBench. Bitween is open-source as a Python\npackage and accessible via a web interface that supports C language programs.\n","authors":["Ferhat Erata","Orr Paradise","Timos Antonopoulos","ThanhVu Nguyen","Shafi Goldwasser","Ruzica Piskac"],"pdf_url":"https://arxiv.org/pdf/2412.18134v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.15654v3","updated":"2024-12-24T03:37:34Z","published":"2024-03-23T00:01:34Z","title":"The Effectiveness of Local Updates for Decentralized Learning under Data\n Heterogeneity","summary":" We revisit two fundamental decentralized optimization methods, Decentralized\nGradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple\nlocal updates. We consider two settings and demonstrate that incorporating\nlocal update steps can reduce communication complexity. Specifically, for\n$\\mu$-strongly convex and $L$-smooth loss functions, we proved that local DGT\nachieves communication complexity {}{$\\tilde{\\mathcal{O}}\n\\Big(\\frac{L}{\\mu(K+1)} + \\frac{\\delta + {}{\\mu}}{\\mu (1 - \\rho)} + \\frac{\\rho\n}{(1 - \\rho)^2} \\cdot \\frac{L+ \\delta}{\\mu}\\Big)$}, %\\zhize{seems to be\n$\\tilde{\\mathcal{O}}$} {where $K$ is the number of additional local update},\n$\\rho$ measures the network connectivity and $\\delta$ measures the second-order\nheterogeneity of the local losses. Our results reveal the tradeoff between\ncommunication and computation and show increasing $K$ can effectively reduce\ncommunication costs when the data heterogeneity is low and the network is\nwell-connected. We then consider the over-parameterization regime where the\nlocal losses share the same minimums. We proved that employing local updates in\nDGD, even without gradient correction, achieves exact linear convergence under\nthe Polyak-{\\L}ojasiewicz (PL) condition, which can yield a similar effect as\nDGT in reducing communication complexity. {}{Customization of the result to\nlinear models is further provided, with improved rate expression. }Numerical\nexperiments validate our theoretical results.\n","authors":["Tongle Wu","Zhize Li","Ying Sun"],"pdf_url":"https://arxiv.org/pdf/2403.15654v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.00852v2","updated":"2024-12-24T03:24:55Z","published":"2024-10-30T11:22:37Z","title":"EF-LLM: Energy Forecasting LLM with AI-assisted Automation, Enhanced\n Sparse Prediction, Hallucination Detection","summary":" Accurate prediction helps to achieve supply-demand balance in energy systems,\nsupporting decision-making and scheduling. Traditional models, lacking\nAI-assisted automation, rely on experts, incur high costs, and struggle with\nsparse data prediction. To address these challenges, we propose the Energy\nForecasting Large Language Model (EF-LLM), which integrates domain knowledge\nand temporal data for time-series forecasting, supporting both pre-forecast\noperations and post-forecast decision-support. EF-LLM's human-AI interaction\ncapabilities lower the entry barrier in forecasting tasks, reducing the need\nfor extra expert involvement. To achieve this, we propose a continual learning\napproach with updatable LoRA and a multi-channel architecture for aligning\nheterogeneous multimodal data, enabling EF-LLM to continually learn\nheterogeneous multimodal knowledge. In addition, EF-LLM enables accurate\npredictions under sparse data conditions through its ability to process\nmultimodal data. We propose Fusion Parameter-Efficient Fine-Tuning (F-PEFT)\nmethod to effectively leverage both time-series data and text for this purpose.\nEF-LLM is also the first energy-specific LLM to detect hallucinations and\nquantify their occurrence rate, achieved via multi-task learning, semantic\nsimilarity analysis, and ANOVA. We have achieved success in energy prediction\nscenarios for load, photovoltaic, and wind power forecast.\n","authors":["Zihang Qiu","Chaojie Li","Zhongyang Wang","Renyou Xie","Borui Zhang","Huadong Mo","Guo Chen","Zhaoyang Dong"],"pdf_url":"https://arxiv.org/pdf/2411.00852v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18119v1","updated":"2024-12-24T03:06:22Z","published":"2024-12-24T03:06:22Z","title":"Age Optimal Sampling for Unreliable Channels under Unknown Channel\n Statistics","summary":" In this paper, we study a system in which a sensor forwards status updates to\na receiver through an error-prone channel, while the receiver sends the\ntransmission results back to the sensor via a reliable channel. Both channels\nare subject to random delays. To evaluate the timeliness of the status\ninformation at the receiver, we use the Age of Information (AoI) metric. The\nobjective is to design a sampling policy that minimizes the expected\ntime-average AoI, even when the channel statistics (e.g., delay distributions)\nare unknown. We first review the threshold structure of the optimal offline\npolicy under known channel statistics and then reformulate the design of the\nonline algorithm as a stochastic approximation problem. We propose a\nRobbins-Monro algorithm to solve this problem and demonstrate that the optimal\nthreshold can be approximated almost surely. Moreover, we prove that the\ncumulative AoI regret of the online algorithm increases with rate\n$\\mathcal{O}(\\ln K)$, where $K$ is the number of successful transmissions. In\naddition, our algorithm is shown to be minimax order optimal, in the sense that\nfor any online learning algorithm, the cumulative AoI regret up to the $K$-th\nsuccessful transmissions grows with the rate at least $\\Omega(\\ln K)$ in the\nworst case delay distribution. Finally, we improve the stability of the\nproposed online learning algorithm through a momentum-based stochastic gradient\ndescent algorithm. Simulation results validate the performance of our proposed\nalgorithm.\n","authors":["Hongyi He","Haoyue Tang","Jiayu Pan","Jintao Wang","Jian Song","Leandros Tassiulas"],"pdf_url":"https://arxiv.org/pdf/2412.18119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.10958v3","updated":"2024-12-24T02:50:14Z","published":"2024-11-17T04:35:49Z","title":"SageAttention2: Efficient Attention with Thorough Outlier Smoothing and\n Per-thread INT4 Quantization","summary":" Although quantization for linear layers has been widely used, its application\nto accelerate the attention process remains limited. To further enhance the\nefficiency of attention computation compared to SageAttention while maintaining\nprecision, we propose SageAttention2, which utilizes significantly faster 4-bit\nmatrix multiplication (Matmul) alongside additional precision-enhancing\ntechniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a\nhardware-friendly thread-level granularity and quantize matrixes $(\\widetilde\nP, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the\naccuracy of INT4 $QK$. Third, we propose to use an FP32 Matmul buffer for $PV$\nto enhance the accuracy of FP8 $\\widetilde PV$. The operations per second (OPS)\nof SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on\nRTX4090, respectively. Comprehensive experiments confirm that our approach\nincurs negligible end-to-end metrics loss across diverse models, including\nthose for large language processing, image generation, and video generation.\nThe codes are available at https://github.com/thu-ml/SageAttention.\n","authors":["Jintao Zhang","Haofeng Huang","Pengle Zhang","Jia Wei","Jun Zhu","Jianfei Chen"],"pdf_url":"https://arxiv.org/pdf/2411.10958v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.02367v3","updated":"2024-12-24T02:29:17Z","published":"2024-10-03T10:25:23Z","title":"SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference\n Acceleration","summary":" The transformer architecture predominates across various models. As the heart\nof the transformer, attention has a computational complexity of O(N^2),\ncompared to O(N) for linear transformations. When handling large sequence\nlengths, attention becomes the primary time-consuming component. Although\nquantization has proven to be an effective method for accelerating model\ninference, existing quantization methods primarily focus on optimizing the\nlinear layer. In response, we first analyze the feasibility of quantization in\nattention detailedly. Following that, we propose SageAttention, a highly\nefficient and accurate quantization method for attention. The OPS (operations\nper second) of our approach outperforms FlashAttention2 and xformers by about\n2.1 times and 2.7 times, respectively. SageAttention also achieves superior\naccuracy performance over FlashAttention3. Comprehensive experiments confirm\nthat our approach incurs almost no end-to-end metrics loss across diverse\nmodels, including those for large language processing, image generation, and\nvideo generation. The codes are available at\nhttps://github.com/thu-ml/SageAttention.\n","authors":["Jintao Zhang","Jia wei","Haofeng Huang","Pengle Zhang","Jun Zhu","Jianfei Chen"],"pdf_url":"https://arxiv.org/pdf/2410.02367v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10748v2","updated":"2024-12-24T02:27:53Z","published":"2024-12-14T08:31:56Z","title":"A Pioneering Neural Network Method for Efficient and Robust Fuel\n Sloshing Simulation in Aircraft","summary":" Simulating fuel sloshing within aircraft tanks during flight is crucial for\naircraft safety research. Traditional methods based on Navier-Stokes equations\nare computationally expensive. In this paper, we treat fluid motion as point\ncloud transformation and propose the first neural network method specifically\ndesigned for simulating fuel sloshing in aircraft. This model is also the deep\nlearning model that is the first to be capable of stably modeling fluid\nparticle dynamics in such complex scenarios. Our triangle feature fusion design\nachieves an optimal balance among fluid dynamics modeling, momentum\nconservation constraints, and global stability control. Additionally, we\nconstructed the Fueltank dataset, the first dataset for aircraft fuel surface\nsloshing. It comprises 320,000 frames across four typical tank types and covers\na wide range of flight maneuvers, including multi-directional rotations. We\nconducted comprehensive experiments on both our dataset and the take-off\nscenario of the aircraft. Compared to existing neural network-based fluid\nsimulation algorithms, we significantly enhanced accuracy while maintaining\nhigh computational speed. Compared to traditional SPH methods, our speed\nimproved approximately 10 times. Furthermore, compared to traditional fluid\nsimulation software such as Flow3D, our computation speed increased by more\nthan 300 times.\n","authors":["Yu Chen","Shuai Zheng","Nianyi Wang","Menglong Jin","Yan Chang"],"pdf_url":"https://arxiv.org/pdf/2412.10748v2.pdf","comment":"This paper has been accepted by AAAI Conference on Artificial\n Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2402.02431v2","updated":"2024-12-24T02:20:02Z","published":"2024-02-04T10:00:00Z","title":"Learning Mutual Excitation for Hand-to-Hand and Human-to-Human\n Interaction Recognition","summary":" Recognizing interactive actions, including hand-to-hand interaction and\nhuman-to-human interaction, has attracted increasing attention for various\napplications in the field of video analysis and human-robot interaction.\nConsidering the success of graph convolution in modeling topology-aware\nfeatures from skeleton data, recent methods commonly operate graph convolution\non separate entities and use late fusion for interactive action recognition,\nwhich can barely model the mutual semantic relationships between pairwise\nentities. To this end, we propose a mutual excitation graph convolutional\nnetwork (me-GCN) by stacking mutual excitation graph convolution (me-GC)\nlayers. Specifically, me-GC uses a mutual topology excitation module to firstly\nextract adjacency matrices from individual entities and then adaptively model\nthe mutual constraints between them. Moreover, me-GC extends the above idea and\nfurther uses a mutual feature excitation module to extract and merge deep\nfeatures from pairwise entities. Compared with graph convolution, our proposed\nme-GC gradually learns mutual information in each layer and each stage of graph\nconvolution operations. Extensive experiments on a challenging hand-to-hand\ninteraction dataset, i.e., the Assembely101 dataset, and two large-scale\nhuman-to-human interaction datasets, i.e., NTU60-Interaction and\nNTU120-Interaction consistently verify the superiority of our proposed method,\nwhich outperforms the state-of-the-art GCN-based and Transformer-based methods.\n","authors":["Mengyuan Liu","Chen Chen","Songtao Wu","Fanyang Meng","Hong Liu"],"pdf_url":"https://arxiv.org/pdf/2402.02431v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16830v2","updated":"2024-12-24T02:19:59Z","published":"2024-12-22T02:36:09Z","title":"Algorithm Design for Continual Learning in IoT Networks","summary":" Continual learning (CL) is a new online learning technique over sequentially\ngenerated streaming data from different tasks, aiming to maintain a small\nforgetting loss on previously-learned tasks. Existing work focuses on reducing\nthe forgetting loss under a given task sequence. However, if similar tasks\ncontinuously appear to the end time, the forgetting loss is still huge on prior\ndistinct tasks. In practical IoT networks, an autonomous vehicle to sample data\nand learn different tasks can route and alter the order of task pattern at\nincreased travelling cost. To our best knowledge, we are the first to study how\nto opportunistically route the testing object and alter the task sequence in\nCL. We formulate a new optimization problem and prove it NP-hard. We propose a\npolynomial-time algorithm to achieve approximation ratios of $\\frac{3}{2}$ for\nunderparameterized case and $\\frac{3}{2} + r^{1-T}$ for overparameterized case,\nrespectively, where $r:=1-\\frac{n}{m}$ is a parameter of feature number $m$ and\nsample number $n$ and $T$ is the task number. Simulation results verify our\nalgorithm's close-to-optimum performance.\n","authors":["Shugang Hao","Lingjie Duan"],"pdf_url":"https://arxiv.org/pdf/2412.16830v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.18597v1","updated":"2024-12-24T18:51:19Z","published":"2024-12-24T18:51:19Z","title":"DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion\n Transformer for Tuning-Free Multi-Prompt Longer Video Generation","summary":" Sora-like video generation models have achieved remarkable progress with a\nMulti-Modal Diffusion Transformer MM-DiT architecture. However, the current\nvideo generation models predominantly focus on single-prompt, struggling to\ngenerate coherent scenes with multiple sequential prompts that better reflect\nreal-world dynamic scenarios. While some pioneering works have explored\nmulti-prompt video generation, they face significant challenges including\nstrict training data requirements, weak prompt following, and unnatural\ntransitions. To address these problems, we propose DiTCtrl, a training-free\nmulti-prompt video generation method under MM-DiT architectures for the first\ntime. Our key idea is to take the multi-prompt video generation task as\ntemporal video editing with smooth transitions. To achieve this goal, we first\nanalyze MM-DiT's attention mechanism, finding that the 3D full attention\nbehaves similarly to that of the cross/self-attention blocks in the UNet-like\ndiffusion models, enabling mask-guided precise semantic control across\ndifferent prompts with attention sharing for multi-prompt video generation.\nBased on our careful design, the video generated by DiTCtrl achieves smooth\ntransitions and consistent object motion given multiple sequential prompts\nwithout additional training. Besides, we also present MPVBench, a new benchmark\nspecially designed for multi-prompt video generation to evaluate the\nperformance of multi-prompt generation. Extensive experiments demonstrate that\nour method achieves state-of-the-art performance without additional training.\n","authors":["Minghong Cai","Xiaodong Cun","Xiaoyu Li","Wenze Liu","Zhaoyang Zhang","Yong Zhang","Ying Shan","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2412.18597v1.pdf","comment":"19 pages, 19 figures, Project page:\n https://onevfall.github.io/project_page/ditctrl ; GitHub repository:\n https://github.com/TencentARC/DiTCtrl"},{"id":"http://arxiv.org/abs/2412.18416v1","updated":"2024-12-24T13:08:34Z","published":"2024-12-24T13:08:34Z","title":"Muse: A Multimodal Conversational Recommendation Dataset with\n Scenario-Grounded User Profiles","summary":" Current conversational recommendation systems focus predominantly on text.\nHowever, real-world recommendation settings are generally multimodal, causing a\nsignificant gap between existing research and practical applications. To\naddress this issue, we propose Muse, the first multimodal conversational\nrecommendation dataset. Muse comprises 83,148 utterances from 7,000\nconversations centered around the Clothing domain. Each conversation contains\ncomprehensive multimodal interactions, rich elements, and natural dialogues.\nData in Muse are automatically synthesized by a multi-agent framework powered\nby multimodal large language models (MLLMs). It innovatively derives user\nprofiles from real-world scenarios rather than depending on manual design and\nhistory data for better scalability, and then it fulfills conversation\nsimulation and optimization. Both human and LLM evaluations demonstrate the\nhigh quality of conversations in Muse. Additionally, fine-tuning experiments on\nthree MLLMs demonstrate Muse's learnable patterns for recommendations and\nresponses, confirming its value for multimodal conversational recommendation.\nOur dataset and codes are available at\n\\url{https://anonymous.4open.science/r/Muse-0086}.\n","authors":["Zihan Wang","Xiaocui Yang","Yongkang Liu","Shi Feng","Daling Wang","Yifei Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.18416v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18390v1","updated":"2024-12-24T12:28:19Z","published":"2024-12-24T12:28:19Z","title":"RDPM: Solve Diffusion Probabilistic Models via Recurrent Token\n Prediction","summary":" Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach\nfor high-fidelity image synthesis, operating diffusion processes on continuous\nVAE latent, which significantly differ from the text generation methods\nemployed by Large Language Models (LLMs). In this paper, we introduce a novel\ngenerative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which\nenhances the diffusion process through a recurrent token prediction mechanism,\nthereby pioneering the field of Discrete Diffusion. By progressively\nintroducing Gaussian noise into the latent representations of images and\nencoding them into vector-quantized tokens in a recurrent manner, RDPM\nfacilitates a unique diffusion process on discrete-value domains. This process\niteratively predicts the token codes for subsequent timesteps, transforming the\ninitial standard Gaussian noise into the source data distribution, aligning\nwith GPT-style models in terms of the loss function. RDPM demonstrates superior\nperformance while benefiting from the speed advantage of requiring only a few\ninference steps. This model not only leverages the diffusion process to ensure\nhigh-quality generation but also converts continuous signals into a series of\nhigh-fidelity discrete tokens, thereby maintaining a unified optimization\nstrategy with other discrete tokens, such as text. We anticipate that this work\nwill contribute to the development of a unified model for multimodal\ngeneration, specifically by integrating continuous signal domains such as\nimages, videos, and audio with text. We will release the code and model weights\nto the open-source community.\n","authors":["Wu Xiaoping","Hu Jie","Wei Xiaoming"],"pdf_url":"https://arxiv.org/pdf/2412.18390v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2409.08772v2","updated":"2024-12-24T08:18:25Z","published":"2024-09-13T12:30:15Z","title":"The Practice of Averaging Rate-Distortion Curves over Testsets to\n Compare Learned Video Codecs Can Cause Misleading Conclusions","summary":" This paper aims to demonstrate how the prevalent practice in the learned\nvideo compression community of averaging rate-distortion (RD) curves across a\ntest video set can lead to misleading conclusions in evaluating codec\nperformance. Through analytical analysis of a simple case and experimental\nresults with two recent learned video codecs, we show how averaged RD curves\ncan mislead comparative evaluation of different codecs, particularly when\nvideos in a dataset have varying characteristics and operating ranges. We\nillustrate how a single video with distinct RD characteristics from the rest of\nthe test set can disproportionately influence the average RD curve, potentially\novershadowing a codec's superior performance across most individual sequences.\nUsing two recent learned video codecs on the UVG dataset as a case study, we\ndemonstrate computing performance metrics, such as the BD rate, from the\naverage RD curve suggests conclusions that contradict those reached from\ncalculating the average of per-sequence metrics. Hence, we argue that the\nlearned video compression community should also report per-sequence RD curves\nand performance metrics for a test set should be computed from the average of\nper-sequence metrics, similar to the established practice in traditional video\ncoding, to ensure fair and accurate codec comparisons.\n","authors":["M. Akin Yilmaz","Onur Keleş","A. Murat Tekalp"],"pdf_url":"https://arxiv.org/pdf/2409.08772v2.pdf","comment":"Submitted to IEEE Signal Processing Letters"},{"id":"http://arxiv.org/abs/2410.20898v2","updated":"2024-12-24T05:22:40Z","published":"2024-10-28T10:26:19Z","title":"Diff-Instruct*: Towards Human-Preferred One-step Text-to-image\n Generative Models","summary":" In this paper, we introduce the Diff-Instruct* (DI*), an image data-free\napproach for building one-step text-to-image generative models that align with\nhuman preference while maintaining the ability to generate highly realistic\nimages. We frame human preference alignment as online reinforcement learning\nusing human feedback (RLHF), where the goal is to maximize the reward function\nwhile regularizing the generator distribution to remain close to a reference\ndiffusion process. Unlike traditional RLHF approaches, which rely on the KL\ndivergence for regularization, we introduce a novel score-based divergence\nregularization, which leads to significantly better performances. Although the\ndirect calculation of this preference alignment objective remains intractable,\nwe demonstrate that we can efficiently compute its gradient by deriving an\nequivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to\ntrain a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step\ntext-to-image model, which can generate images of a resolution of 1024x1024\nwith only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference\ntime and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly\nin PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1\non Human Preference Score benchmark, establishing a new state-of-the-art\nbenchmark of human-preferred 1-step text-to-image generative models. Besides\nthe strong quantitative performances, extensive qualitative comparisons also\nconfirm the advantages of DI* in terms of maintaining diversity, improving\nimage layouts, and enhancing aesthetic colors. We have released our\nindustry-ready model on the homepage:\n\\url{https://github.com/pkulwj1994/diff_instruct_star}.\n","authors":["Weijian Luo","Colin Zhang","Debing Zhang","Zhengyang Geng"],"pdf_url":"https://arxiv.org/pdf/2410.20898v2.pdf","comment":"revision: 2.6B 1-step text-to-image model outperforms 12B\n Flux-dev-50step model in human preferences"},{"id":"http://arxiv.org/abs/2412.16642v2","updated":"2024-12-24T04:20:18Z","published":"2024-12-21T14:24:32Z","title":"L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text\n Compression","summary":" Learning-based probabilistic models can be combined with an entropy coder for\ndata compression. However, due to the high complexity of learning-based models,\ntheir practical application as text compressors has been largely overlooked. To\naddress this issue, our work focuses on a low-complexity design while\nmaintaining compression performance. We introduce a novel Learned Lossless\nLow-complexity Text Compression method (L3TC). Specifically, we conduct\nextensive experiments demonstrating that RWKV models achieve the fastest\ndecoding speed with a moderate compression ratio, making it the most suitable\nbackbone for our method. Second, we propose an outlier-aware tokenizer that\nuses a limited vocabulary to cover frequent tokens while allowing outliers to\nbypass the prediction and encoding. Third, we propose a novel high-rank\nreparameterization strategy that enhances the learning capability during\ntraining without increasing complexity during inference. Experimental results\nvalidate that our method achieves 48% bit saving compared to gzip compressor.\nBesides, L3TC offers compression performance comparable to other learned\ncompressors, with a 50x reduction in model parameters. More importantly, L3TC\nis the fastest among all learned compressors, providing real-time decoding\nspeeds up to megabytes per second. Our code is available at\nhttps://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.\n","authors":["Junxuan Zhang","Zhengxue Cheng","Yan Zhao","Shihao Wang","Dajiang Zhou","Guo Lu","Li Song"],"pdf_url":"https://arxiv.org/pdf/2412.16642v2.pdf","comment":null}]},"2024-12-23T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.18051v1","updated":"2024-12-23T23:55:19Z","published":"2024-12-23T23:55:19Z","title":"Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with\n Citations","summary":" Benchmarking modern large language models (LLMs) on complex and realistic\ntasks is critical to advancing their development. In this work, we evaluate the\nfactual accuracy and citation performance of state-of-the-art LLMs on the task\nof Question Answering (QA) in ambiguous settings with source citations. Using\nthree recently published datasets-DisentQA-DupliCite, DisentQA-ParaCite, and\nAmbigQA-Cite-featuring a range of real-world ambiguities, we analyze the\nperformance of two leading LLMs, GPT-4o-mini and Claude-3.5. Our results show\nthat larger, recent models consistently predict at least one correct answer in\nambiguous contexts but fail to handle cases with multiple valid answers.\nAdditionally, all models perform equally poorly in citation generation, with\ncitation accuracy consistently at 0. However, introducing conflict-aware\nprompting leads to large improvements, enabling models to better address\nmultiple valid answers and improve citation accuracy, while maintaining their\nability to predict correct answers. These findings highlight the challenges and\nopportunities in developing LLMs that can handle ambiguity and provide reliable\nsource citations. Our benchmarking study provides critical insights and sets a\nfoundation for future improvements in trustworthy and interpretable QA systems.\n","authors":["Maya Patel","Aditi Anand"],"pdf_url":"https://arxiv.org/pdf/2412.18051v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14368v2","updated":"2024-12-23T23:46:55Z","published":"2024-12-18T22:04:56Z","title":"Memorization Over Reasoning? Exposing and Mitigating Verbatim\n Memorization in Large Language Models' Character Understanding Evaluation","summary":" Recently, Large Language Models (LLMs) have shown impressive performance in\ncharacter understanding tasks, such as analyzing the roles, personalities, and\nrelationships of fictional characters. However, the extensive pre-training\ncorpora used by LLMs raise concerns that they may rely on memorizing popular\nfictional works rather than genuinely understanding and reasoning about them.\nIn this work, we argue that 'gist memory'-capturing essential meaning - should\nbe the primary mechanism for character understanding tasks, as opposed to\n'verbatim memory' - exact match of a string. We introduce a simple yet\neffective method to mitigate mechanized memorization in character understanding\nevaluations while preserving the essential implicit cues needed for\ncomprehension and reasoning. Our approach reduces memorization-driven\nperformance on popular fictional works from 96% accuracy to 72% and results in\nup to an 18% drop in accuracy across various character understanding tasks.\nThese findings underscore the issue of data contamination in existing\nbenchmarks, which often measure memorization rather than true character\nunderstanding.\n","authors":["Yuxuan Jiang","Francis Ferraro"],"pdf_url":"https://arxiv.org/pdf/2412.14368v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18046v1","updated":"2024-12-23T23:44:13Z","published":"2024-12-23T23:44:13Z","title":"Emoji Retrieval from Gibberish or Garbled Social Media Text: A Novel\n Methodology and A Case Study","summary":" Emojis are widely used across social media platforms but are often lost in\nnoisy or garbled text, posing challenges for data analysis and machine\nlearning. Conventional preprocessing approaches recommend removing such text,\nrisking the loss of emojis and their contextual meaning. This paper proposes a\nthree-step reverse-engineering methodology to retrieve emojis from garbled text\nin social media posts. The methodology also identifies reasons for the\ngeneration of such text during social media data mining. To evaluate its\neffectiveness, the approach was applied to 509,248 Tweets about the Mpox\noutbreak, a dataset referenced in about 30 prior works that failed to retrieve\nemojis from garbled text. Our method retrieved 157,748 emojis from 76,914\nTweets. Improvements in text readability and coherence were demonstrated\nthrough metrics such as Flesch Reading Ease, Flesch-Kincaid Grade Level,\nColeman-Liau Index, Automated Readability Index, Dale-Chall Readability Score,\nText Standard, and Reading Time. Additionally, the frequency of individual\nemojis and their patterns of usage in these Tweets were analyzed, and the\nresults are presented.\n","authors":["Shuqi Cui","Nirmalya Thakur","Audrey Poon"],"pdf_url":"https://arxiv.org/pdf/2412.18046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18043v1","updated":"2024-12-23T23:39:05Z","published":"2024-12-23T23:39:05Z","title":"Aligning AI Research with the Needs of Clinical Coding Workflows: Eight\n Recommendations Based on US Data Analysis and Critical Review","summary":" Clinical coding is crucial for healthcare billing and data analysis. Manual\nclinical coding is labour-intensive and error-prone, which has motivated\nresearch towards full automation of the process. However, our analysis, based\non US English electronic health records and automated coding research using\nthese records, shows that widely used evaluation methods are not aligned with\nreal clinical contexts. For example, evaluations that focus on the top 50 most\ncommon codes are an oversimplification, as there are thousands of codes used in\npractice. This position paper aims to align AI coding research more closely\nwith practical challenges of clinical coding. Based on our analysis, we offer\neight specific recommendations, suggesting ways to improve current evaluation\nmethods. Additionally, we propose new AI-based methods beyond automated coding,\nsuggesting alternative approaches to assist clinical coders in their workflows.\n","authors":["Yidong Gan","Maciej Rybinski","Ben Hachey","Jonathan K. Kummerfeld"],"pdf_url":"https://arxiv.org/pdf/2412.18043v1.pdf","comment":"We received a meta-review score of 5 in ARR October 2024"},{"id":"http://arxiv.org/abs/2412.18040v1","updated":"2024-12-23T23:26:07Z","published":"2024-12-23T23:26:07Z","title":"Theoretical Constraints on the Expressive Power of $\\mathsf{RoPE}$-based\n Tensor Attention Transformers","summary":" Tensor Attention extends traditional attention mechanisms by capturing\nhigh-order correlations across multiple modalities, addressing the limitations\nof classical matrix-based attention. Meanwhile, Rotary Position Embedding\n($\\mathsf{RoPE}$) has shown superior performance in encoding positional\ninformation in long-context scenarios, significantly enhancing transformer\nmodels' expressiveness. Despite these empirical successes, the theoretical\nlimitations of these technologies remain underexplored. In this study, we\nanalyze the circuit complexity of Tensor Attention and $\\mathsf{RoPE}$-based\nTensor Attention, showing that with polynomial precision, constant-depth\nlayers, and linear or sublinear hidden dimension, they cannot solve fixed\nmembership problems or $(A_{F,r})^*$ closure problems, under the assumption\nthat $\\mathsf{TC}^0 \\neq \\mathsf{NC}^1$. These findings highlight a gap between\nthe empirical performance and theoretical constraints of Tensor Attention and\n$\\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that\ncould guide the development of more theoretically grounded approaches to\nTransformer model design and scaling.\n","authors":["Xiaoyu Li","Yingyu Liang","Zhenmei Shi","Zhao Song","Mingda Wan"],"pdf_url":"https://arxiv.org/pdf/2412.18040v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18036v1","updated":"2024-12-23T23:09:56Z","published":"2024-12-23T23:09:56Z","title":"Explainability in Neural Networks for Natural Language Processing Tasks","summary":" Neural networks are widely regarded as black-box models, creating significant\nchallenges in understanding their inner workings, especially in natural\nlanguage processing (NLP) applications. To address this opacity, model\nexplanation techniques like Local Interpretable Model-Agnostic Explanations\n(LIME) have emerged as essential tools for providing insights into the behavior\nof these complex systems. This study leverages LIME to interpret a multi-layer\nperceptron (MLP) neural network trained on a text classification task. By\nanalyzing the contribution of individual features to model predictions, the\nLIME approach enhances interpretability and supports informed decision-making.\nDespite its effectiveness in offering localized explanations, LIME has\nlimitations in capturing global patterns and feature interactions. This\nresearch highlights the strengths and shortcomings of LIME and proposes\ndirections for future work to achieve more comprehensive interpretability in\nneural NLP models.\n","authors":["Melkamu Mersha","Mingiziem Bitewa","Tsion Abay","Jugal Kalita"],"pdf_url":"https://arxiv.org/pdf/2412.18036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18029v1","updated":"2024-12-23T22:49:38Z","published":"2024-12-23T22:49:38Z","title":"Same Company, Same Signal: The Role of Identity in Earnings Call\n Transcripts","summary":" Post-earnings volatility prediction is critical for investors, with previous\nworks often leveraging earnings call transcripts under the assumption that\ntheir rich semantics contribute significantly. To further investigate how\ntranscripts impact volatility, we introduce DEC, a dataset featuring accurate\nvolatility calculations enabled by the previously overlooked beforeAfterMarket\nattribute and dense ticker coverage. Unlike established benchmarks, where each\nticker has only around two earnings, DEC provides 20 earnings records per\nticker. Using DEC, we reveal that post-earnings volatility undergoes\nsignificant shifts, with each ticker displaying a distinct volatility\ndistribution. To leverage historical post-earnings volatility and capture\nticker-specific patterns, we propose two training-free baselines: Post-earnings\nVolatility (PEV) and Same-ticker Post-earnings Volatility (STPEV). These\nbaselines surpass all transcripts-based models on DEC as well as on established\nbenchmarks. Additionally, we demonstrate that current transcript\nrepresentations predominantly capture ticker identity rather than offering\nfinancially meaningful insights specific to each earnings. This is evidenced by\ntwo key observations: earnings representations from the same ticker exhibit\nsignificantly higher similarity compared to those from different tickers, and\npredictions from transcript-based models show strong correlations with prior\npost-earnings volatility.\n","authors":["Ding Yu","Zhuo Liu","Hangfeng He"],"pdf_url":"https://arxiv.org/pdf/2412.18029v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18011v1","updated":"2024-12-23T22:08:40Z","published":"2024-12-23T22:08:40Z","title":"StructTest: Benchmarking LLMs' Reasoning through Compositional\n Structured Outputs","summary":" The rapid development of large language models (LLMs) necessitates robust,\nunbiased, and scalable methods for evaluating their capabilities. However,\nhuman annotations are expensive to scale, model-based evaluations are prone to\nbiases in answer style, while target-answer-based benchmarks are vulnerable to\ndata contamination and cheating. To address these limitations, we propose\nStructTest, a novel benchmark that evaluates LLMs on their ability to produce\ncompositionally specified structured outputs as an unbiased, cheap-to-run and\ndifficult-to-cheat measure. The evaluation is done deterministically by a\nrule-based evaluator, which can be easily extended to new tasks. By testing\nstructured outputs across diverse task domains -- including Summarization,\nCode, HTML and Math -- we demonstrate that StructTest serves as a good proxy\nfor general reasoning abilities, as producing structured outputs often requires\ninternal logical reasoning. We believe that StructTest offers a critical,\ncomplementary approach to objective and robust model evaluation.\n","authors":["Hailin Chen","Fangkai Jiao","Mathieu Ravaut","Nawshad Farruque","Xuan Phi Nguyen","Chengwei Qin","Manan Dey","Bosheng Ding","Caiming Xiong","Shafiq Joty","Yingbo Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.18011v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18004v1","updated":"2024-12-23T21:57:11Z","published":"2024-12-23T21:57:11Z","title":"Correctness is not Faithfulness in RAG Attributions","summary":" Retrieving relevant context is a common approach to reduce hallucinations and\nenhance answer reliability. Explicitly citing source documents allows users to\nverify generated responses and increases trust. Prior work largely evaluates\ncitation correctness - whether cited documents support the corresponding\nstatements. But citation correctness alone is insufficient. To establish trust\nin attributed answers, we must examine both citation correctness and citation\nfaithfulness. In this work, we first disentangle the notions of citation\ncorrectness and faithfulness, which have been applied inconsistently in\nprevious studies. Faithfulness ensures that the model's reliance on cited\ndocuments is genuine, reflecting actual reference use rather than superficial\nalignment with prior beliefs, which we call post-rationalization. We design an\nexperiment that reveals the prevalent issue of post-rationalization, which\nundermines reliable attribution and may result in misplaced trust. Our findings\nsuggest that current attributed answers often lack citation faithfulness (up to\n57 percent of the citations), highlighting the need to evaluate correctness and\nfaithfulness for trustworthy attribution in language models.\n","authors":["Jonas Wallat","Maria Heuss","Maarten de Rijke","Avishek Anand"],"pdf_url":"https://arxiv.org/pdf/2412.18004v1.pdf","comment":"13 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.05453v2","updated":"2024-12-23T20:40:52Z","published":"2024-12-06T22:25:23Z","title":"Knowledge Graphs are all you need: Leveraging KGs in Physics Question\n Answering","summary":" This study explores the effectiveness of using knowledge graphs generated by\nlarge language models to decompose high school-level physics questions into\nsub-questions. We introduce a pipeline aimed at enhancing model response\nquality for Question Answering tasks. By employing LLMs to construct knowledge\ngraphs that capture the internal logic of the questions, these graphs then\nguide the generation of subquestions. We hypothesize that this method yields\nsub-questions that are more logically consistent with the original questions\ncompared to traditional decomposition techniques. Our results show that\nsub-questions derived from knowledge graphs exhibit significantly improved\nfidelity to the original question's logic. This approach not only enhances the\nlearning experience by providing clearer and more contextually appropriate\nsub-questions but also highlights the potential of LLMs to transform\neducational methodologies. The findings indicate a promising direction for\napplying AI to improve the quality and effectiveness of educational content.\n","authors":["Krishnasai Addala","Kabir Dev Paul Baghel","Dhruv Jain","Chhavi Kirtani","Avinash Anand","Rajiv Ratn Shah"],"pdf_url":"https://arxiv.org/pdf/2412.05453v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17970v1","updated":"2024-12-23T20:34:32Z","published":"2024-12-23T20:34:32Z","title":"CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language\n Models","summary":" Causal reasoning capabilities are essential for large language models (LLMs)\nin a wide range of applications, such as education and healthcare. But there is\nstill a lack of benchmarks for a better understanding of such capabilities.\nCurrent LLM benchmarks are mainly based on conversational tasks, academic math\ntests, and coding tests. Such benchmarks evaluate LLMs in well-regularized\nsettings, but they are limited in assessing the skills and abilities to solve\nreal-world problems. In this work, we provide a benchmark, named by CARL-GT,\nwhich evaluates CAusal Reasoning capabilities of large Language models using\nGraphs and Tabular data. The benchmark has a diverse range of tasks for\nevaluating LLMs from causal graph reasoning, knowledge discovery, and\ndecision-making aspects. In addition, effective zero-shot learning prompts are\ndeveloped for the tasks. In our experiments, we leverage the benchmark for\nevaluating open-source LLMs and provide a detailed comparison of LLMs for\ncausal reasoning abilities. We found that LLMs are still weak in casual\nreasoning, especially with tabular data to discover new insights. Furthermore,\nwe investigate and discuss the relationships of different benchmark tasks by\nanalyzing the performance of LLMs. The experimental results show that LLMs have\ndifferent strength over different tasks and that their performance on tasks in\ndifferent categories, i.e., causal graph reasoning, knowledge discovery, and\ndecision-making, shows stronger correlation than tasks in the same category.\n","authors":["Ruibo Tu","Hedvig Kjellström","Gustav Eje Henter","Cheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.17970v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17963v1","updated":"2024-12-23T20:27:12Z","published":"2024-12-23T20:27:12Z","title":"Path-of-Thoughts: Extracting and Following Paths for Robust Relational\n Reasoning with Large Language Models","summary":" Large language models (LLMs) possess vast semantic knowledge but often\nstruggle with complex reasoning tasks, particularly in relational reasoning\nproblems such as kinship or spatial reasoning. In this paper, we present\nPath-of-Thoughts (PoT), a novel framework designed to tackle relation reasoning\nby decomposing the task into three key stages: graph extraction, path\nidentification, and reasoning. Unlike previous approaches, PoT efficiently\nextracts a task-agnostic graph that identifies crucial entities, relations, and\nattributes within the problem context. Subsequently, PoT identifies relevant\nreasoning chains within the graph corresponding to the posed question,\nfacilitating inference of potential answers. Experimental evaluations on four\nbenchmark datasets, demanding long reasoning chains, demonstrate that PoT\nsurpasses state-of-the-art baselines by a significant margin (maximum 21.3%)\nwithout necessitating fine-tuning or extensive LLM calls. Furthermore, as\nopposed to prior neuro-symbolic methods, PoT exhibits improved resilience\nagainst LLM errors by leveraging the compositional nature of graphs.\n","authors":["Ge Zhang","Mohammad Ali Alomrani","Hongjian Gu","Jiaming Zhou","Yaochen Hu","Bin Wang","Qun Liu","Mark Coates","Yingxue Zhang","Jianye Hao"],"pdf_url":"https://arxiv.org/pdf/2412.17963v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.13178v2","updated":"2024-12-23T20:12:48Z","published":"2024-01-24T01:51:00Z","title":"AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents","summary":" Evaluating Large Language Models (LLMs) as general-purpose agents is\nessential for understanding their capabilities and facilitating their\nintegration into practical applications. However, the evaluation process\npresents substantial challenges. A primary obstacle is the benchmarking of\nagent performance across diverse scenarios within a unified framework,\nespecially in maintaining partially-observable environments and ensuring\nmulti-round interactions. Moreover, current evaluation frameworks mostly focus\non the final success rate, revealing few insights during the process and\nfailing to provide a deep understanding of the model abilities. To address\nthese challenges, we introduce AgentBoard, a pioneering comprehensive benchmark\nand accompanied open-source evaluation framework tailored to analytical\nevaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric\nthat captures incremental advancements as well as a comprehensive evaluation\ntoolkit that features easy assessment of agents for multi-faceted analysis.\nThis not only sheds light on the capabilities and limitations of LLM agents but\nalso propels the interpretability of their performance to the forefront.\nUltimately, AgentBoard serves as a step towards demystifying agent behaviors\nand accelerating the development of stronger LLM agents.\n","authors":["Chang Ma","Junlei Zhang","Zhihao Zhu","Cheng Yang","Yujiu Yang","Yaohui Jin","Zhenzhong Lan","Lingpeng Kong","Junxian He"],"pdf_url":"https://arxiv.org/pdf/2401.13178v2.pdf","comment":"NeurIPS 2024 (Oral)"},{"id":"http://arxiv.org/abs/2412.05023v2","updated":"2024-12-23T20:06:24Z","published":"2024-12-06T13:20:57Z","title":"Steps are all you need: Rethinking STEM Education with Prompt\n Engineering","summary":" Few shot and Chain-of-Thought prompting have shown promise when applied to\nPhysics Question Answering Tasks, but are limited by the lack of mathematical\nability inherent to LLMs, and are prone to hallucination. By utilizing a\nMixture of Experts (MoE) Model, along with analogical prompting, we are able to\nshow improved model performance when compared to the baseline on standard LLMs.\nWe also survey the limits of these prompting techniques and the effects they\nhave on model performance. Additionally, we propose Analogical CoT prompting, a\nprompting technique designed to allow smaller, open source models to leverage\nAnalogical prompting, something they have struggled with, possibly due to a\nlack of specialist training data.\n","authors":["Krishnasai Addala","Kabir Dev Paul Baghel","Chhavi Kirtani","Avinash Anand","Rajiv Ratn Shah"],"pdf_url":"https://arxiv.org/pdf/2412.05023v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17947v1","updated":"2024-12-23T19:58:11Z","published":"2024-12-23T19:58:11Z","title":"IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate\n Speech Detection and Target Identification in Devanagari-Scripted Languages","summary":" This work focuses on two subtasks related to hate speech detection and target\nidentification in Devanagari-scripted languages, specifically Hindi, Marathi,\nNepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in\nonline text, while Subtask C requires identifying the specific targets of hate\nspeech, such as individuals, organizations, or communities. We propose the\nMultilingualRobertaClass model, a deep neural network built on the pretrained\nmultilingual transformer model ia-multilingual-transliterated-roberta,\noptimized for classification tasks in multilingual and transliterated contexts.\nThe model leverages contextualized embeddings to handle linguistic diversity,\nwith a classifier head for binary classification. We received 88.40% accuracy\nin Subtask B and 66.11% accuracy in Subtask C, in the test set.\n","authors":["Siddhant Gupta","Siddh Singhal","Azmine Toushik Wasi"],"pdf_url":"https://arxiv.org/pdf/2412.17947v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.12880v2","updated":"2024-12-23T19:53:33Z","published":"2024-10-15T18:13:10Z","title":"Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to\n Sensitivity in Large Language Models","summary":" As LLMs are increasingly deployed in global applications, the importance of\ncultural sensitivity becomes paramount, ensuring that users from diverse\nbackgrounds feel respected and understood. Cultural harm can arise when these\nmodels fail to align with specific cultural norms, resulting in\nmisrepresentations or violations of cultural values. This work addresses the\nchallenges of ensuring cultural sensitivity in LLMs, especially in\nsmall-parameter models that often lack the extensive training data needed to\ncapture global cultural nuances. We present two key contributions: (1) A\ncultural harm test dataset, created to assess model outputs across different\ncultural contexts through scenarios that expose potential cultural\ninsensitivities, and (2) A culturally aligned preference dataset, aimed at\nrestoring cultural sensitivity through fine-tuning based on feedback from\ndiverse annotators. These datasets facilitate the evaluation and enhancement of\nLLMs, ensuring their ethical and safe deployment across different cultural\nlandscapes. Our results show that integrating culturally aligned feedback leads\nto a marked improvement in model behavior, significantly reducing the\nlikelihood of generating culturally insensitive or harmful content. Ultimately,\nthis work paves the way for more inclusive and respectful AI systems, fostering\na future where LLMs can safely and ethically navigate the complexities of\ndiverse cultural landscapes.\n","authors":["Somnath Banerjee","Sayan Layek","Hari Shrawgi","Rajarshi Mandal","Avik Halder","Shanu Kumar","Sagnik Basu","Parag Agrawal","Rima Hazra","Animesh Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2410.12880v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17933v1","updated":"2024-12-23T19:45:20Z","published":"2024-12-23T19:45:20Z","title":"BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for\n Large Language Models with Duel Scoring Mechanism","summary":" We present BenCzechMark (BCM), the first comprehensive Czech language\nbenchmark designed for large language models, offering diverse tasks, multiple\ntask formats, and multiple evaluation metrics. Its scoring system is grounded\nin statistical significance theory and uses aggregation across tasks inspired\nby social preference theory. Our benchmark encompasses 50 challenging tasks,\nwith corresponding test datasets, primarily in native Czech, with 11 newly\ncollected ones. These tasks span 8 categories and cover diverse domains,\nincluding historical Czech news, essays from pupils or language learners, and\nspoken word.\n Furthermore, we collect and clean BUT-Large Czech Collection, the largest\npublicly available clean Czech language corpus, and use it for (i)\ncontamination analysis, (ii) continuous pretraining of the first Czech-centric\n7B language model, with Czech-specific tokenization. We use our model as a\nbaseline for comparison with publicly available multilingual models. Lastly, we\nrelease and maintain a leaderboard, with existing 44 model submissions, where\nnew model submissions can be made at\nhttps://huggingface.co/spaces/CZLC/BenCzechMark.\n","authors":["Martin Fajcik","Martin Docekal","Jan Dolezal","Karel Ondrej","Karel Beneš","Jan Kapsa","Pavel Smrz","Alexander Polok","Michal Hradis","Zuzana Neverilova","Ales Horak","Radoslav Sabol","Michal Stefanik","Adam Jirkovsky","David Adamczyk","Petr Hyner","Jan Hula","Hynek Kydlicek"],"pdf_url":"https://arxiv.org/pdf/2412.17933v1.pdf","comment":"first version"},{"id":"http://arxiv.org/abs/2412.17921v1","updated":"2024-12-23T19:24:51Z","published":"2024-12-23T19:24:51Z","title":"VITRO: Vocabulary Inversion for Time-series Representation Optimization","summary":" Although LLMs have demonstrated remarkable capabilities in processing and\ngenerating textual data, their pre-trained vocabularies are ill-suited for\ncapturing the nuanced temporal dynamics and patterns inherent in time series.\nThe discrete, symbolic nature of natural language tokens, which these\nvocabularies are designed to represent, does not align well with the\ncontinuous, numerical nature of time series data. To address this fundamental\nlimitation, we propose VITRO. Our method adapts textual inversion optimization\nfrom the vision-language domain in order to learn a new time series per-dataset\nvocabulary that bridges the gap between the discrete, semantic nature of\nnatural language and the continuous, numerical nature of time series data. We\nshow that learnable time series-specific pseudo-word embeddings represent time\nseries data better than existing general language model vocabularies, with\nVITRO-enhanced methods achieving state-of-the-art performance in long-term\nforecasting across most datasets.\n","authors":["Filippos Bellos","Nam H. Nguyen","Jason J. Corso"],"pdf_url":"https://arxiv.org/pdf/2412.17921v1.pdf","comment":"Accepted to ICASSP 2025"},{"id":"http://arxiv.org/abs/2407.04108v3","updated":"2024-12-23T19:24:44Z","published":"2024-07-04T18:24:09Z","title":"Future Events as Backdoor Triggers: Investigating Temporal\n Vulnerabilities in LLMs","summary":" Backdoors are hidden behaviors that are only triggered once an AI system has\nbeen deployed. Bad actors looking to create successful backdoors must design\nthem to avoid activation during training and evaluation. Since data used in\nthese stages often only contains information about events that have already\noccurred, a component of a simple backdoor trigger could be a model recognizing\ndata that is in the future relative to when it was trained. Through prompting\nexperiments and by probing internal activations, we show that current large\nlanguage models (LLMs) can distinguish past from future events, with probes on\nmodel activations achieving 90% accuracy. We train models with backdoors\ntriggered by a temporal distributional shift; they activate when the model is\nexposed to news headlines beyond their training cut-off dates. Fine-tuning on\nhelpful, harmless and honest (HHH) data does not work well for removing simpler\nbackdoor triggers but is effective on our backdoored models, although this\ndistinction is smaller for the larger-scale model we tested. We also find that\nan activation-steering vector representing a model's internal representation of\nthe date influences the rate of backdoor activation. We take these results as\ninitial evidence that, at least for models at the modest scale we test,\nstandard safety measures are enough to remove these backdoors.\n","authors":["Sara Price","Arjun Panickssery","Sam Bowman","Asa Cooper Stickland"],"pdf_url":"https://arxiv.org/pdf/2407.04108v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14962v2","updated":"2024-12-23T19:01:23Z","published":"2024-11-22T14:21:18Z","title":"LLM for Barcodes: Generating Diverse Synthetic Data for Identity\n Documents","summary":" Accurate barcode detection and decoding in Identity documents is crucial for\napplications like security, healthcare, and education, where reliable data\nextraction and verification are essential. However, building robust detection\nmodels is challenging due to the lack of diverse, realistic datasets an issue\noften tied to privacy concerns and the wide variety of document formats.\nTraditional tools like Faker rely on predefined templates, making them less\neffective for capturing the complexity of real-world identity documents. In\nthis paper, we introduce a new approach to synthetic data generation that uses\nLLMs to create contextually rich and realistic data without relying on\npredefined field. Using the vast knowledge LLMs have about different documents\nand content, our method creates data that reflects the variety found in real\nidentity documents. This data is then encoded into barcode and overlayed on\ntemplates for documents such as Driver's licenses, Insurance cards, Student\nIDs. Our approach simplifies the process of dataset creation, eliminating the\nneed for extensive domain knowledge or predefined fields. Compared to\ntraditional methods like Faker, data generated by LLM demonstrates greater\ndiversity and contextual relevance, leading to improved performance in barcode\ndetection models. This scalable, privacy-first solution is a big step forward\nin advancing machine learning for automated document processing and identity\nverification.\n","authors":["Hitesh Laxmichand Patel","Amit Agarwal","Bhargava Kumar","Karan Gupta","Priyaranjan Pattnayak"],"pdf_url":"https://arxiv.org/pdf/2411.14962v2.pdf","comment":"5 pages, 1 figures"},{"id":"http://arxiv.org/abs/2412.17907v1","updated":"2024-12-23T19:00:34Z","published":"2024-12-23T19:00:34Z","title":"A Multimodal Emotion Recognition System: Integrating Facial Expressions,\n Body Movement, Speech, and Spoken Language","summary":" Traditional psychological evaluations rely heavily on human observation and\ninterpretation, which are prone to subjectivity, bias, fatigue, and\ninconsistency. To address these limitations, this work presents a multimodal\nemotion recognition system that provides a standardised, objective, and\ndata-driven tool to support evaluators, such as psychologists, psychiatrists,\nand clinicians. The system integrates recognition of facial expressions,\nspeech, spoken language, and body movement analysis to capture subtle emotional\ncues that are often overlooked in human evaluations. By combining these\nmodalities, the system provides more robust and comprehensive emotional state\nassessment, reducing the risk of mis- and overdiagnosis. Preliminary testing in\na simulated real-world condition demonstrates the system's potential to provide\nreliable emotional insights to improve the diagnostic accuracy. This work\nhighlights the promise of automated multimodal analysis as a valuable\ncomplement to traditional psychological evaluation practices, with applications\nin clinical and therapeutic settings.\n","authors":["Kris Kraack"],"pdf_url":"https://arxiv.org/pdf/2412.17907v1.pdf","comment":"10 pages, 6 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.17787v1","updated":"2024-12-23T18:48:04Z","published":"2024-12-23T18:48:04Z","title":"Cross-Lingual Text-Rich Visual Comprehension: An Information Theory\n Perspective","summary":" Recent Large Vision-Language Models (LVLMs) have shown promising reasoning\ncapabilities on text-rich images from charts, tables, and documents. However,\nthe abundant text within such images may increase the model's sensitivity to\nlanguage. This raises the need to evaluate LVLM performance on cross-lingual\ntext-rich visual inputs, where the language in the image differs from the\nlanguage of the instructions. To address this, we introduce XT-VQA\n(Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to\nassess how LVLMs handle language inconsistency between image text and\nquestions. XT-VQA integrates five existing text-rich VQA datasets and a newly\ncollected dataset, XPaperQA, covering diverse scenarios that require faithful\nrecognition and comprehension of visual information despite language\ninconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a\nsignificant drop in performance for cross-lingual scenarios, even for models\nwith multilingual capabilities. A mutual information analysis suggests that\nthis performance gap stems from cross-lingual questions failing to adequately\nactivate relevant visual information. To mitigate this issue, we propose\nMVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information),\nwhere a visual-text cross-lingual alignment is built by maximizing mutual\ninformation between the model's outputs and visual information. This is\nachieved by distilling knowledge from monolingual to cross-lingual settings\nthrough KL divergence minimization, where monolingual output logits serve as a\nteacher. Experimental results on the XT-VQA demonstrate that MVCL-MI\neffectively reduces the visual-text cross-lingual performance disparity while\npreserving the inherent capabilities of LVLMs, shedding new light on the\npotential practice for improving LVLMs. Codes are available at:\nhttps://github.com/Stardust-y/XTVQA.git\n","authors":["Xinmiao Yu","Xiaocheng Feng","Yun Li","Minghui Liao","Ya-Qi Yu","Xiachong Feng","Weihong Zhong","Ruihan Chen","Mengkang Hu","Jihao Wu","Dandan Tu","Duyu Tang","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2412.17787v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.06608v4","updated":"2024-12-23T18:38:36Z","published":"2024-06-06T18:10:11Z","title":"The Prompt Report: A Systematic Survey of Prompting Techniques","summary":" Generative Artificial Intelligence (GenAI) systems are increasingly being\ndeployed across diverse industries and research domains. Developers and\nend-users interact with these systems through the use of prompting and prompt\nengineering. Although prompt engineering is a widely adopted and extensively\nresearched area, it suffers from conflicting terminology and a fragmented\nontological understanding of what constitutes an effective prompt due to its\nrelatively recent emergence. We establish a structured understanding of prompt\nengineering by assembling a taxonomy of prompting techniques and analyzing\ntheir applications. We present a detailed vocabulary of 33 vocabulary terms, a\ntaxonomy of 58 LLM prompting techniques, and 40 techniques for other\nmodalities. Additionally, we provide best practices and guidelines for prompt\nengineering, including advice for prompting state-of-the-art (SOTA) LLMs such\nas ChatGPT. We further present a meta-analysis of the entire literature on\nnatural language prefix-prompting. As a culmination of these efforts, this\npaper presents the most comprehensive survey on prompt engineering to date.\n","authors":["Sander Schulhoff","Michael Ilie","Nishant Balepur","Konstantine Kahadze","Amanda Liu","Chenglei Si","Yinheng Li","Aayush Gupta","HyoJung Han","Sevien Schulhoff","Pranav Sandeep Dulepet","Saurav Vidyadhara","Dayeon Ki","Sweta Agrawal","Chau Pham","Gerson Kroiz","Feileen Li","Hudson Tao","Ashay Srivastava","Hevander Da Costa","Saloni Gupta","Megan L. Rogers","Inna Goncearenco","Giuseppe Sarli","Igor Galynker","Denis Peskoff","Marine Carpuat","Jules White","Shyamal Anadkat","Alexander Hoyle","Philip Resnik"],"pdf_url":"https://arxiv.org/pdf/2406.06608v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17767v1","updated":"2024-12-23T18:26:53Z","published":"2024-12-23T18:26:53Z","title":"ResearchTown: Simulator of Human Research Community","summary":" Large Language Models (LLMs) have demonstrated remarkable potential in\nscientific domains, yet a fundamental question remains unanswered: Can we\nsimulate human research communities with LLMs? Addressing this question can\ndeepen our understanding of the processes behind idea brainstorming and inspire\nthe automatic discovery of novel scientific insights. In this work, we propose\nResearchTown, a multi-agent framework for research community simulation. Within\nthis framework, the human research community is simplified and modeled as an\nagent-data graph, where researchers and papers are represented as agent-type\nand data-type nodes, respectively, and connected based on their collaboration\nrelationships. We also introduce TextGNN, a text-based inference framework that\nmodels various research activities (e.g., paper reading, paper writing, and\nreview writing) as special forms of a unified message-passing process on the\nagent-data graph. To evaluate the quality of the research simulation, we\npresent ResearchBench, a benchmark that uses a node-masking prediction task for\nscalable and objective assessment based on similarity. Our experiments reveal\nthree key findings: (1) ResearchTown can provide a realistic simulation of\ncollaborative research activities, including paper writing and review writing;\n(2) ResearchTown can maintain robust simulation with multiple researchers and\ndiverse papers; (3) ResearchTown can generate interdisciplinary research ideas\nthat potentially inspire novel research directions.\n","authors":["Haofei Yu","Zhaochen Hong","Zirui Cheng","Kunlun Zhu","Keyang Xuan","Jinwei Yao","Tao Feng","Jiaxuan You"],"pdf_url":"https://arxiv.org/pdf/2412.17767v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17758v1","updated":"2024-12-23T18:14:36Z","published":"2024-12-23T18:14:36Z","title":"In Case You Missed It: ARC 'Challenge' Is Not That Challenging","summary":" ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily\ndue to an evaluation setup that prevents direct comparison of answer choices\nrather than inherent complexity. Although some researchers have quietly shifted\nto a more appropriate scheme over the last year, the implications of this\nchange have yet to be widely acknowledged. We highlight this overlooked shift,\nshow how similar evaluation practices falsely imply reasoning deficits in other\nbenchmarks, and demonstrate that fairer methods dramatically reduce performance\ngaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing\nso, we reveal how evaluation shapes perceived difficulty and offer guidelines\nto ensure that multiple-choice evaluations accurately reflect actual model\ncapabilities.\n","authors":["Łukasz Borchmann"],"pdf_url":"https://arxiv.org/pdf/2412.17758v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.04637v2","updated":"2024-12-23T18:09:34Z","published":"2024-11-07T11:51:14Z","title":"Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop","summary":" Training and deploying machine learning models relies on a large amount of\nhuman-annotated data. As human labeling becomes increasingly expensive and\ntime-consuming, recent research has developed multiple strategies to speed up\nannotation and reduce costs and human workload: generating synthetic training\ndata, active learning, and hybrid labeling. This tutorial is oriented toward\npractical applications: we will present the basics of each strategy, highlight\ntheir benefits and limitations, and discuss in detail real-life case studies.\nAdditionally, we will walk through best practices for managing human annotators\nand controlling the quality of the final dataset. The tutorial includes a\nhands-on workshop, where attendees will be guided in implementing a hybrid\nannotation setup. This tutorial is designed for NLP practitioners from both\nresearch and industry backgrounds who are involved in or interested in\noptimizing data labeling projects.\n","authors":["Ekaterina Artemova","Akim Tsvigun","Dominik Schlechtweg","Natalia Fedorova","Sergei Tilga","Konstantin Chernyshev","Boris Obmoroshev"],"pdf_url":"https://arxiv.org/pdf/2411.04637v2.pdf","comment":"To be presented at COLING 2025"},{"id":"http://arxiv.org/abs/2412.17747v1","updated":"2024-12-23T18:02:25Z","published":"2024-12-23T18:02:25Z","title":"Deliberation in Latent Space via Differentiable Cache Augmentation","summary":" Techniques enabling large language models (LLMs) to \"think more\" by\ngenerating and attending to intermediate reasoning steps have shown promise in\nsolving complex problems. However, the standard approaches generate sequences\nof discrete tokens immediately before responding, and so they can incur\nsignificant latency costs and be challenging to optimize. In this work, we\ndemonstrate that a frozen LLM can be augmented with an offline coprocessor that\noperates on the model's key-value (kv) cache. This coprocessor augments the\ncache with a set of latent embeddings designed to improve the fidelity of\nsubsequent decoding. We train this coprocessor using the language modeling loss\nfrom the decoder on standard pretraining data, while keeping the decoder itself\nfrozen. This approach enables the model to learn, in an end-to-end\ndifferentiable fashion, how to distill additional computation into its\nkv-cache. Because the decoder remains unchanged, the coprocessor can operate\noffline and asynchronously, and the language model can function normally if the\ncoprocessor is unavailable or if a given cache is deemed not to require extra\ncomputation. We show experimentally that when a cache is augmented, the decoder\nachieves lower perplexity on numerous subsequent tokens. Furthermore, even\nwithout any task-specific training, our experiments demonstrate that cache\naugmentation consistently reduces perplexity and improves performance across a\nrange of reasoning-intensive tasks.\n","authors":["Luyang Liu","Jonas Pfeiffer","Jiaxing Wu","Jun Xie","Arthur Szlam"],"pdf_url":"https://arxiv.org/pdf/2412.17747v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15241v2","updated":"2024-12-23T17:59:23Z","published":"2024-12-13T09:52:25Z","title":"Quantifying Positional Biases in Text Embedding Models","summary":" Embedding models are crucial for tasks in Information Retrieval (IR) and\nsemantic similarity measurement, yet their handling of longer texts and\nassociated positional biases remains underexplored. In this study, we\ninvestigate the impact of content position and input size on text embeddings.\nOur experiments reveal that embedding models, irrespective of their positional\nencoding mechanisms, disproportionately prioritize the beginning of an input.\nAblation studies demonstrate that insertion of irrelevant text or removal at\nthe start of a document reduces cosine similarity between altered and original\nembeddings by up to 12.3\\% more than ablations at the end. Regression analysis\nfurther confirms this bias, with sentence importance declining as position\nmoves further from the start, even with with content-agnosticity. We\nhypothesize that this effect arises from pre-processing strategies and chosen\npositional encoding techniques. These findings quantify the sensitivity of\nretrieval systems and suggest a new lens towards embedding model robustness.\n","authors":["Reagan J. Lee","Samarth Goel","Kannan Ramchandran"],"pdf_url":"https://arxiv.org/pdf/2412.15241v2.pdf","comment":"13 pages, 11 figures, NeurIPS"},{"id":"http://arxiv.org/abs/2409.03752v3","updated":"2024-12-23T17:57:29Z","published":"2024-09-05T17:59:12Z","title":"Attention Heads of Large Language Models: A Survey","summary":" Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in\nvarious tasks but remain as black-box systems. Understanding the reasoning\nbottlenecks of LLMs has become a critical challenge, as these limitations are\ndeeply tied to their internal architecture. Among these, attention heads have\nemerged as a focal point for investigating the underlying mechanics of LLMs. In\nthis survey, we aim to demystify the internal reasoning processes of LLMs by\nsystematically exploring the roles and mechanisms of attention heads. We first\nintroduce a novel four-stage framework inspired by the human thought process:\nKnowledge Recalling, In-Context Identification, Latent Reasoning, and\nExpression Preparation. Using this framework, we comprehensively review\nexisting research to identify and categorize the functions of specific\nattention heads. Additionally, we analyze the experimental methodologies used\nto discover these special heads, dividing them into two categories:\nModeling-Free and Modeling-Required methods. We further summarize relevant\nevaluation methods and benchmarks. Finally, we discuss the limitations of\ncurrent research and propose several potential future directions.\n","authors":["Zifan Zheng","Yezhaohui Wang","Yuxin Huang","Shichao Song","Mingchuan Yang","Bo Tang","Feiyu Xiong","Zhiyu Li"],"pdf_url":"https://arxiv.org/pdf/2409.03752v3.pdf","comment":"33 pages, 11 figures, 7 tables, 7 equations"},{"id":"http://arxiv.org/abs/2412.17744v1","updated":"2024-12-23T17:52:10Z","published":"2024-12-23T17:52:10Z","title":"RepoTransBench: A Real-World Benchmark for Repository-Level Code\n Translation","summary":" Repository-level code translation refers to translating an entire code\nrepository from one programming language to another while preserving the\nfunctionality of the source repository. Many benchmarks have been proposed to\nevaluate the performance of such code translators. However, previous benchmarks\nmostly provide fine-grained samples, focusing at either code snippet, function,\nor file-level code translation. Such benchmarks do not accurately reflect\nreal-world demands, where entire repositories often need to be translated,\ninvolving longer code length and more complex functionalities. To address this\ngap, we propose a new benchmark, named RepoTransBench, which is a real-world\nrepository-level code translation benchmark with an automatically executable\ntest suite. We conduct experiments on RepoTransBench to evaluate the\ntranslation performance of 11 advanced LLMs. We find that the Success@1 score\n(test success in one attempt) of the best-performing LLM is only 7.33%. To\nfurther explore the potential of LLMs for repository-level code translation, we\nprovide LLMs with error-related feedback to perform iterative debugging and\nobserve an average 7.09% improvement on Success@1. However, even with this\nimprovement, the Success@1 score of the best-performing LLM is only 21%, which\nmay not meet the need for reliable automatic repository-level code translation.\nFinally, we conduct a detailed error analysis and highlight current LLMs'\ndeficiencies in repository-level code translation, which could provide a\nreference for further improvements.\n","authors":["Yanli Wang","Yanlin Wang","Suiquan Wang","Daya Guo","Jiachi Chen","John Grundy","Xilin Liu","Yuchi Ma","Mingzhi Mao","Hongyu Zhang","Zibin Zheng"],"pdf_url":"https://arxiv.org/pdf/2412.17744v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17739v1","updated":"2024-12-23T17:44:01Z","published":"2024-12-23T17:44:01Z","title":"Fourier Position Embedding: Enhancing Attention's Periodic Extension for\n Length Generalization","summary":" Extending the context length of Language Models (LMs) by improving Rotary\nPosition Embedding (RoPE) has become a trend. While existing works mainly\naddress RoPE's limitations within attention mechanism, this paper provides an\nanalysis across nearly all parts of LMs, uncovering their adverse effects on\nlength generalization for RoPE-based attention. Using Discrete Signal\nProcessing theory, we show that RoPE enables periodic attention by implicitly\nachieving Non-Uniform Discrete Fourier Transform. However, this periodicity is\nundermined by the spectral damage caused by: 1) linear layers and activation\nfunctions outside of attention; 2) insufficiently trained frequency components\nbrought by time-domain truncation. Building on our observations, we propose\nFourier Position Embedding (FoPE), which enhances attention's frequency-domain\nproperties to improve both its periodic extension and length generalization.\nFoPE constructs Fourier Series and zero-outs the destructive frequency\ncomponents, increasing model robustness against the spectrum damage.\nExperiments across various model scales show that, within varying context\nwindows, FoPE can maintain a more stable perplexity and a more consistent\naccuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several\nanalyses and ablations bring further support to our method and theoretical\nmodeling.\n","authors":["Ermo Hua","Che Jiang","Xingtai Lv","Kaiyan Zhang","Ning Ding","Youbang Sun","Biqing Qi","Yuchen Fan","Xue Kai Zhu","Bowen Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.17739v1.pdf","comment":"14 pages, 7 figures"},{"id":"http://arxiv.org/abs/2407.13690v2","updated":"2024-12-23T17:32:21Z","published":"2024-06-18T07:14:02Z","title":"DART-Math: Difficulty-Aware Rejection Tuning for Mathematical\n Problem-Solving","summary":" Solving mathematical problems requires advanced reasoning abilities and\npresents notable challenges for large language models. Previous works usually\nsynthesize data from proprietary models to augment existing datasets, followed\nby instruction tuning to achieve top-tier results. However, our analysis of\nthese datasets reveals severe biases towards easy queries, with frequent\nfailures to generate any correct response for the most challenging queries.\nHypothesizing that difficult queries are crucial to learn complex reasoning, we\npropose Difficulty-Aware Rejection Tuning (DART), a method that allocates\ndifficult queries more trials during the synthesis phase, enabling more\nextensive training on difficult samples. Utilizing DART, we have created new\ndatasets for mathematical problem-solving that focus more on difficult queries\nand are substantially smaller than previous ones. Remarkably, our synthesis\nprocess solely relies on a 7B-sized open-weight model, without reliance on the\ncommonly used proprietary GPT-4. We fine-tune various base models on our\ndatasets ranging from 7B to 70B in size, resulting in a series of strong models\ncalled DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6\nmathematical benchmarks, DART-MATH outperforms vanilla rejection tuning\nsignificantly, being superior or comparable to previous arts, despite using\nmuch smaller datasets and no proprietary models. Furthermore, our results\nposition our synthetic datasets as the most effective and cost-efficient\npublicly available resources for advancing mathematical problem-solving.\n","authors":["Yuxuan Tong","Xiwen Zhang","Rui Wang","Ruidong Wu","Junxian He"],"pdf_url":"https://arxiv.org/pdf/2407.13690v2.pdf","comment":"NeurIPS 2024. Data and model checkpoints are available at\n https://github.com/hkust-nlp/dart-math"},{"id":"http://arxiv.org/abs/2412.17729v1","updated":"2024-12-23T17:19:58Z","published":"2024-12-23T17:19:58Z","title":"Chumor 2.0: Towards Benchmarking Chinese Humor Understanding","summary":" Existing humor datasets and evaluations predominantly focus on English,\nleaving limited resources for culturally nuanced humor in non-English languages\nlike Chinese. To address this gap, we construct Chumor, the first Chinese humor\nexplanation dataset that exceeds the size of existing humor datasets. Chumor is\nsourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing\nintellectually challenging and culturally specific jokes. We test ten LLMs\nthrough direct and chain-of-thought prompting, revealing that Chumor poses\nsignificant challenges to existing LLMs, with their accuracy slightly above\nrandom and far below human. In addition, our analysis highlights that\nhuman-annotated humor explanations are significantly better than those\ngenerated by GPT-4o and ERNIE-4-turbo. We release Chumor at\nhttps://huggingface.co/datasets/dnaihao/Chumor, our project page is at\nhttps://dnaihao.github.io/Chumor-dataset/, our leaderboard is at\nhttps://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at\nhttps://github.com/dnaihao/Chumor-dataset.\n","authors":["Ruiqi He","Yushu He","Longju Bai","Jiarui Liu","Zhenjie Sun","Zenghao Tang","He Wang","Hanchen Xia","Rada Mihalcea","Naihao Deng"],"pdf_url":"https://arxiv.org/pdf/2412.17729v1.pdf","comment":"arXiv admin note: substantial text overlap with arXiv:2406.12754"},{"id":"http://arxiv.org/abs/2412.17727v1","updated":"2024-12-23T17:17:50Z","published":"2024-12-23T17:17:50Z","title":"Knowledge Editing through Chain-of-Thought","summary":" Large Language Models (LLMs) have demonstrated exceptional capabilities\nacross a wide range of natural language processing (NLP) tasks. However,\nkeeping these models up-to-date with evolving world knowledge remains a\nsignificant challenge due to the high costs of frequent retraining. To address\nthis challenge, knowledge editing techniques have emerged to update LLMs with\nnew information without rebuilding the model from scratch. Among these, the\nin-context editing paradigm stands out for its effectiveness in integrating new\nknowledge while preserving the model's original capabilities. Despite its\npotential, existing in-context knowledge editing methods are often\ntask-specific, focusing primarily on multi-hop QA tasks using structured\nknowledge triples. Moreover, their reliance on few-shot prompting for task\ndecomposition makes them unstable and less effective in generalizing across\ndiverse tasks.\n In response to these limitations, we propose EditCoT, a novel knowledge\nediting framework that flexibly and efficiently updates LLMs across various\ntasks without retraining. EditCoT works by generating a chain-of-thought (CoT)\nfor a given input and then iteratively refining this CoT process using a CoT\neditor based on updated knowledge. We evaluate EditCoT across a diverse range\nof benchmarks, covering multiple languages and tasks. The results demonstrate\nthat our approach achieves state-of-the-art performance while offering superior\ngeneralization, effectiveness, and stability compared to existing methods,\nmarking a significant advancement in the field of knowledge updating. Code and\ndata are available at: https://github.com/bebr2/EditCoT.\n","authors":["Changyue Wang","Weihang Su","Qingyao Ai","Yiqun Liu"],"pdf_url":"https://arxiv.org/pdf/2412.17727v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.02103v2","updated":"2024-12-23T17:01:11Z","published":"2024-04-02T17:00:11Z","title":"CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions\n for RAG systems","summary":" Retrieval Augmented Generation (RAG) has become a popular application for\nlarge language models. It is preferable that successful RAG systems provide\naccurate answers that are supported by being grounded in a passage without any\nhallucinations. While considerable work is required for building a full RAG\npipeline, being able to benchmark performance is also necessary. We present\nClapNQ, a benchmark Long-form Question Answering dataset for the full RAG\npipeline. ClapNQ includes long answers with grounded gold passages from Natural\nQuestions (NQ) and a corpus to perform either retrieval, generation, or the\nfull RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full\npassage, and cohesive, meaning that the answer is composed fluently, often by\nintegrating multiple pieces of the passage that are not contiguous. RAG models\nmust adapt to these properties to be successful at ClapNQ. We present baseline\nexperiments and analysis for ClapNQ that highlight areas where there is still\nsignificant room for improvement in grounded RAG. CLAPNQ is publicly available\nat https://github.com/primeqa/clapnq\n","authors":["Sara Rosenthal","Avirup Sil","Radu Florian","Salim Roukos"],"pdf_url":"https://arxiv.org/pdf/2404.02103v2.pdf","comment":"26 pages, Accepted at TACL"},{"id":"http://arxiv.org/abs/2412.17696v1","updated":"2024-12-23T16:23:13Z","published":"2024-12-23T16:23:13Z","title":"Understanding the Logic of Direct Preference Alignment through Logic","summary":" Recent direct preference alignment algorithms (DPA), such as DPO, have shown\ngreat promise in aligning large language models to human preferences. While\nthis has motivated the development of many new variants of the original DPO\nloss, understanding the differences between these recent proposals, as well as\ndeveloping new DPA loss functions, remains difficult given the lack of a\ntechnical and conceptual framework for reasoning about the underlying semantics\nof these algorithms. In this paper, we attempt to remedy this by formalizing\nDPA losses in terms of discrete reasoning problems. Specifically, we ask: Given\nan existing DPA loss, can we systematically derive a symbolic expression that\ncharacterizes its semantics? How do the semantics of two losses relate to each\nother? We propose a novel formalism for characterizing preference losses for\nsingle model and reference model based approaches, and identify symbolic forms\nfor a number of commonly used DPA variants. Further, we show how this formal\nview of preference learning sheds new light on both the size and structure of\nthe DPA loss landscape, making it possible to not only rigorously characterize\nthe relationships between recent loss proposals but also to systematically\nexplore the landscape and derive new loss functions from first principles. We\nhope our framework and findings will help provide useful guidance to those\nworking on human AI alignment.\n","authors":["Kyle Richardson","Vivek Srikumar","Ashish Sabharwal"],"pdf_url":"https://arxiv.org/pdf/2412.17696v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.10571v3","updated":"2024-12-23T16:12:59Z","published":"2024-12-13T21:28:17Z","title":"Evidence Contextualization and Counterfactual Attribution for\n Conversational QA over Heterogeneous Data with RAG Systems","summary":" Retrieval Augmented Generation (RAG) works as a backbone for interacting with\nan enterprise's own data via Conversational Question Answering (ConvQA). In a\nRAG system, a retriever fetches passages from a collection in response to a\nquestion, which are then included in the prompt of a large language model (LLM)\nfor generating a natural language (NL) answer. However, several RAG systems\ntoday suffer from two shortcomings: (i) retrieved passages usually contain\ntheir raw text and lack appropriate document context, negatively impacting both\nretrieval and answering quality; and (ii) attribution strategies that explain\nanswer generation typically rely only on similarity between the answer and the\nretrieved passages, thereby only generating plausible but not causal\nexplanations. In this work, we demonstrate RAGONITE, a RAG system that remedies\nthe above concerns by: (i) contextualizing evidence with source metadata and\nsurrounding text; and (ii) computing counterfactual attribution, a causal\nexplanation approach where the contribution of an evidence to an answer is\ndetermined by the similarity of the original response to the answer obtained by\nremoving that evidence. To evaluate our proposals, we release a new benchmark\nConfQuestions: it has 300 hand-created conversational questions, each in\nEnglish and German, coupled with ground truth URLs, completed questions, and\nanswers from 215 public Confluence pages. These documents are typical of\nenterprise wiki spaces with heterogeneous elements. Experiments with RAGONITE\non ConfQuestions show the viability of our ideas: contextualization improves\nRAG performance, and counterfactual explanations outperform standard\nattribution.\n","authors":["Rishiraj Saha Roy","Joel Schlotthauer","Chris Hinze","Andreas Foltyn","Luzian Hahn","Fabian Kuech"],"pdf_url":"https://arxiv.org/pdf/2412.10571v3.pdf","comment":"Accepted at WSDM 2025, 8 pages"},{"id":"http://arxiv.org/abs/2412.17686v1","updated":"2024-12-23T16:11:27Z","published":"2024-12-23T16:11:27Z","title":"Large Language Model Safety: A Holistic Survey","summary":" The rapid development and deployment of large language models (LLMs) have\nintroduced a new frontier in artificial intelligence, marked by unprecedented\ncapabilities in natural language understanding and generation. However, the\nincreasing integration of these models into critical applications raises\nsubstantial safety concerns, necessitating a thorough examination of their\npotential risks and associated mitigation strategies.\n This survey provides a comprehensive overview of the current landscape of LLM\nsafety, covering four major categories: value misalignment, robustness to\nadversarial attacks, misuse, and autonomous AI risks. In addition to the\ncomprehensive review of the mitigation methodologies and evaluation resources\non these four aspects, we further explore four topics related to LLM safety:\nthe safety implications of LLM agents, the role of interpretability in\nenhancing LLM safety, the technology roadmaps proposed and abided by a list of\nAI companies and institutes for LLM safety, and AI governance aimed at LLM\nsafety with discussions on international cooperation, policy proposals, and\nprospective regulatory directions.\n Our findings underscore the necessity for a proactive, multifaceted approach\nto LLM safety, emphasizing the integration of technical solutions, ethical\nconsiderations, and robust governance frameworks. This survey is intended to\nserve as a foundational resource for academy researchers, industry\npractitioners, and policymakers, offering insights into the challenges and\nopportunities associated with the safe integration of LLMs into society.\nUltimately, it seeks to contribute to the safe and beneficial development of\nLLMs, aligning with the overarching goal of harnessing AI for societal\nadvancement and well-being. A curated list of related papers has been publicly\navailable at https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers.\n","authors":["Dan Shi","Tianhao Shen","Yufei Huang","Zhigen Li","Yongqi Leng","Renren Jin","Chuang Liu","Xinwei Wu","Zishan Guo","Linhao Yu","Ling Shi","Bojian Jiang","Deyi Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.17686v1.pdf","comment":"158 pages, 18 figures"},{"id":"http://arxiv.org/abs/2412.17669v1","updated":"2024-12-23T15:54:15Z","published":"2024-12-23T15:54:15Z","title":"Generating Completions for Fragmented Broca's Aphasic Sentences Using\n Large Language Models","summary":" Broca's aphasia is a type of aphasia characterized by non-fluent, effortful\nand fragmented speech production with relatively good comprehension. Since\ntraditional aphasia treatment methods are often time-consuming,\nlabour-intensive, and do not reflect real-world conversations, applying natural\nlanguage processing based approaches such as Large Language Models (LLMs) could\npotentially contribute to improving existing treatment approaches. To address\nthis issue, we explore the use of sequence-to-sequence LLMs for completing\nfragmented Broca's aphasic sentences. We first generate synthetic Broca's\naphasic data using a rule-based system designed to mirror the linguistic\ncharacteristics of Broca's aphasic speech. Using this synthetic data, we then\nfine-tune four pre-trained LLMs on the task of completing fragmented sentences.\nWe evaluate our fine-tuned models on both synthetic and authentic Broca's\naphasic data. We demonstrate LLMs' capability for reconstructing fragmented\nsentences, with the models showing improved performance with longer input\nutterances. Our result highlights the LLMs' potential in advancing\ncommunication aids for individuals with Broca's aphasia and possibly other\nclinical populations.\n","authors":["Sijbren van Vaals","Yevgen Matusevych","Frank Tsiwah"],"pdf_url":"https://arxiv.org/pdf/2412.17669v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17891v1","updated":"2024-12-23T15:49:43Z","published":"2024-12-23T15:49:43Z","title":"The Power of Adaptation: Boosting In-Context Learning through Adaptive\n Prompting","summary":" Large Language Models (LLMs) have demonstrated exceptional abilities across a\nbroad range of language-related tasks, including generating solutions to\ncomplex reasoning problems. An effective technique to enhance LLM performance\nis in-context learning, which encourages a step-by-step reasoning process by\nincluding explanatory examples to guide the model's responses. However,\nselecting appropriate exemplars for the model poses a challenge, as each\ndataset demands a distinct set of exemplars to enable the LLM to learn\neffectively and perform well on the test set. Current studies often rely on\nuncertainty- or diversity-based selection strategies to select exemplars for\nannotation and to improve model learning. However, these studies typically\nemploy a non-adaptive approach, selecting a set of exemplars all at once. We\nargue that this non-adaptive strategy may result in a set of exemplars with\nhigh redundancy in terms of the knowledge covered, ultimately reducing their\noverall informativeness. To address this limitation, we propose\n\\textsc{Adaptive-Prompt}, a novel method that adaptively selects exemplars by\nleveraging model feedback from previously chosen exemplars. Experimental\nresults show that \\textsc{Adaptive-Prompt} significantly enhances LLM\nperformance across a variety of reasoning tasks.\n","authors":["Shuzhang Cai","Twumasi Mensah-Boateng","Xander Kuksov","Jing Yuan","Shaojie Tang"],"pdf_url":"https://arxiv.org/pdf/2412.17891v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11745v2","updated":"2024-12-23T15:36:32Z","published":"2024-08-21T16:11:59Z","title":"FocusLLM: Precise Understanding of Long Context by Dynamic Condensing","summary":" Empowering LLMs with the ability to precisely understand long contexts is\ncrucial for many downstream applications. However, handling long contexts with\nconventional transformer architecture requires substantial training and\ninference resources. Existing context condensing methods cannot accurately\nunderstand the full context, as there is a considerable amount of information\nloss in the condensing process. To address these issues, we present FocusLLM, a\nframework designed to extend the fixed context length of any decoder-only LLM,\nallowing the model to focus on relevant information from very long sequences.\nFocusLLM first divides long text input into chunks based on the model's\noriginal context length. It then employs the dynamic condensing process to\ndistill crucial information from each chunk. Ultimately, through the novel\nparallel decoding mechanism, FocusLLM can integrate the extracted information\ninto its local context. FocusLLM stands out for great training efficiency and\nversatility: trained with an 8K input length and with much less training cost\nthan previous methods, FocusLLM exhibits superior performance across downstream\ntasks and maintains strong language modeling ability when handling extensive\nlong texts, even up to 400K tokens. Our code is available at\nhttps://github.com/leezythu/FocusLLM.\n","authors":["Zhenyu Li","Yike Zhang","Tengyu Pan","Yutao Sun","Zhichao Duan","Junjie Fang","Rong Han","Zixuan Wang","Jianyong Wang"],"pdf_url":"https://arxiv.org/pdf/2408.11745v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17626v1","updated":"2024-12-23T14:58:37Z","published":"2024-12-23T14:58:37Z","title":"Tracking the Feature Dynamics in LLM Training: A Mechanistic Study","summary":" Understanding training dynamics and feature evolution is crucial for the\nmechanistic interpretability of large language models (LLMs). Although sparse\nautoencoders (SAEs) have been used to identify features within LLMs, a clear\npicture of how these features evolve during training remains elusive. In this\nstudy, we: (1) introduce SAE-Track, a method to efficiently obtain a continual\nseries of SAEs; (2) formulate the process of feature formation and conduct a\nmechanistic analysis; and (3) analyze and visualize feature drift during\ntraining. Our work provides new insights into the dynamics of features in LLMs,\nenhancing our understanding of training mechanisms and feature evolution.\n","authors":["Yang Xu","Yi Wang","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.17626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08274v3","updated":"2024-12-23T14:32:28Z","published":"2024-12-11T10:46:21Z","title":"2M-BELEBELE: Highly Multilingual Speech and American Sign Language\n Comprehension Dataset","summary":" We introduce the first highly multilingual speech and American Sign Language\n(ASL) comprehension dataset by extending BELEBELE. Our dataset covers 74 spoken\nlanguages at the intersection of BELEBELE and FLEURS, and one sign language\n(ASL). We evaluate 2M-BELEBELE dataset for both 5-shot and zero-shot settings\nand across languages, the speech comprehension accuracy is ~ 2-3% average lower\ncompared to reading comprehension.\n","authors":["Marta R. Costa-jussà","Bokai Yu","Pierre Andrews","Belen Alastruey","Necati Cihan Camgoz","Joe Chuang","Jean Maillard","Christophe Ropers","Arina Turkantenko","Carleigh Wood"],"pdf_url":"https://arxiv.org/pdf/2412.08274v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17596v1","updated":"2024-12-23T14:13:44Z","published":"2024-12-23T14:13:44Z","title":"LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea\n Generation with Minimal Context","summary":" While Large Language Models (LLMs) have demonstrated remarkable capabilities\nin scientific tasks, existing evaluation frameworks primarily assess their\nperformance using rich contextual inputs, overlooking their ability to generate\nnovel ideas from minimal information. We introduce LiveIdeaBench, a\ncomprehensive benchmark that evaluates LLMs' scientific creativity and\ndivergent thinking capabilities using single-keyword prompts. Drawing from\nGuilford's creativity theory, our framework employs a dynamic panel of\nstate-of-the-art LLMs to assess generated ideas across four key dimensions:\noriginality, feasibility, fluency, and flexibility. Through extensive\nexperimentation with 20 leading models across 1,180 keywords spanning 18\nscientific domains, we reveal that scientific creative ability shows distinct\npatterns from general intelligence metrics. Notably, our results demonstrate\nthat models like QwQ-32B-preview achieve comparable creative performance to\ntop-tier models like o1-preview, despite significant gaps in their general\nintelligence scores. These findings highlight the importance of specialized\nevaluation frameworks for scientific creativity and suggest that the\ndevelopment of creative capabilities in LLMs may follow different trajectories\nthan traditional problem-solving abilities.\n","authors":["Kai Ruan","Xuan Wang","Jixiang Hong","Hao Sun"],"pdf_url":"https://arxiv.org/pdf/2412.17596v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.13945v2","updated":"2024-12-23T14:10:09Z","published":"2024-06-20T02:25:07Z","title":"CityBench: Evaluating the Capabilities of Large Language Models for\n Urban Tasks","summary":" Recently, large language models (LLMs) with extensive general knowledge and\npowerful reasoning abilities have seen rapid development and widespread\napplication. A systematic and reliable evaluation of LLMs or vision-language\nmodel (VLMs) is a crucial step in applying and developing them for various\nfields. There have been some early explorations about the usability of LLMs for\nlimited urban tasks, but a systematic and scalable evaluation benchmark is\nstill lacking. The challenge in constructing a systematic evaluation benchmark\nfor urban research lies in the diversity of urban data, the complexity of\napplication scenarios and the highly dynamic nature of the urban environment.\nIn this paper, we design CityBench, an interactive simulator based evaluation\nplatform, as the first systematic benchmark for evaluating the capabilities of\nLLMs for diverse tasks in urban research. First, we build CityData to integrate\nthe diverse urban data and CitySimu to simulate fine-grained urban dynamics.\nBased on CityData and CitySimu, we design 8 representative urban tasks in 2\ncategories of perception-understanding and decision-making as the CityBench.\nWith extensive results from 30 well-known LLMs and VLMs in 13 cities around the\nworld, we find that advanced LLMs and VLMs can achieve competitive performance\nin diverse urban tasks requiring commonsense and semantic understanding\nabilities, e.g., understanding the human dynamics and semantic inference of\nurban images. Meanwhile, they fail to solve the challenging urban tasks\nrequiring professional knowledge and high-level reasoning abilities, e.g.,\ngeospatial prediction and traffic control task. These observations provide\nvaluable perspectives for utilizing and developing LLMs in the future. Codes\nare openly accessible via https://github.com/tsinghua-fib-lab/CityBench.\n","authors":["Jie Feng","Jun Zhang","Tianhui Liu","Xin Zhang","Tianjian Ouyang","Junbo Yan","Yuwei Du","Siqi Guo","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2406.13945v2.pdf","comment":"26 pages, https://github.com/tsinghua-fib-lab/CityBench"},{"id":"http://arxiv.org/abs/2412.17592v1","updated":"2024-12-23T14:08:45Z","published":"2024-12-23T14:08:45Z","title":"Investigating Length Issues in Document-level Machine Translation","summary":" Transformer architectures are increasingly effective at processing and\ngenerating very long chunks of texts, opening new perspectives for\ndocument-level machine translation (MT). In this work, we challenge the ability\nof MT systems to handle texts comprising up to several thousands of tokens. We\ndesign and implement a new approach designed to precisely measure the effect of\nlength increments on MT outputs. Our experiments with two representative\narchitectures unambiguously show that (a)~translation performance decreases\nwith the length of the input text; (b)~the position of sentences within the\ndocument matters and translation quality is higher for sentences occurring\nearlier in a document. We further show that manipulating the distribution of\ndocument lengths and of positional embeddings only marginally mitigates such\nproblems. Our results suggest that even though document-level MT is\ncomputationally feasible, it does not yet match the performance of\nsentence-based MT.\n","authors":["Ziqian Peng","Rachel Bawden","François Yvon"],"pdf_url":"https://arxiv.org/pdf/2412.17592v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2410.06846v2","updated":"2024-12-23T13:53:32Z","published":"2024-10-09T13:06:43Z","title":"Joint Fine-tuning and Conversion of Pretrained Speech and Language\n Models towards Linear Complexity","summary":" Architectures such as Linformer and Mamba have recently emerged as\ncompetitive linear time replacements for transformers. However, corresponding\nlarge pretrained models are often unavailable, especially in non-text domains.\nTo remedy this, we present a Cross-Architecture Layerwise Distillation (CALD)\napproach that jointly converts a transformer model to a linear time substitute\nand fine-tunes it to a target task. We also compare several means to guide the\nfine-tuning to optimally retain the desired inference capability from the\noriginal model. The methods differ in their use of the target model and the\ntrajectory of the parameters. In a series of empirical studies on language\nprocessing, language modeling, and speech processing, we show that CALD can\neffectively recover the result of the original model, and that the guiding\nstrategy contributes to the result. Some reasons for the variation are\nsuggested.\n","authors":["Mutian He","Philip N. Garner"],"pdf_url":"https://arxiv.org/pdf/2410.06846v2.pdf","comment":"17 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.17562v1","updated":"2024-12-23T13:33:09Z","published":"2024-12-23T13:33:09Z","title":"ERUPD -- English to Roman Urdu Parallel Dataset","summary":" Bridging linguistic gaps fosters global growth and cultural exchange. This\nstudy addresses the challenges of Roman Urdu -- a Latin-script adaptation of\nUrdu widely used in digital communication -- by creating a novel parallel\ndataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization,\nphonetic variability, and code-switching with English complicates language\nprocessing. We tackled this by employing a hybrid approach that combines\nsynthetic data generated via advanced prompt engineering with real-world\nconversational data from personal messaging groups. We further refined the\ndataset through a human evaluation phase, addressing linguistic inconsistencies\nand ensuring accuracy in code-switching, phonetic representations, and synonym\nvariability. The resulting dataset captures Roman Urdu's diverse linguistic\nfeatures and serves as a critical resource for machine translation, sentiment\nanalysis, and multilingual education.\n","authors":["Mohammed Furqan","Raahid Bin Khaja","Rayyan Habeeb"],"pdf_url":"https://arxiv.org/pdf/2412.17562v1.pdf","comment":"9 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.17558v1","updated":"2024-12-23T13:26:04Z","published":"2024-12-23T13:26:04Z","title":"A Survey of Query Optimization in Large Language Models","summary":" \\textit{Query Optimization} (QO) refers to techniques aimed at enhancing the\nefficiency and quality of Large Language Models (LLMs) in understanding and\nanswering queries, especially complex ones in scenarios like\nRetrieval-Augmented Generation (RAG). Specifically, RAG mitigates the\nlimitations of LLMs by dynamically retrieving and leveraging up-to-date\nrelevant information, which provides a cost-effective solution to the challenge\nof LLMs producing plausible but potentially inaccurate responses. Recently, as\nRAG evolves and incorporates multiple components that influence its\nperformance, QO has emerged as a critical element, playing a pivotal role in\ndetermining the effectiveness of RAG's retrieval stage in accurately sourcing\nthe necessary multiple pieces of evidence to answer queries correctly. In this\npaper, we trace the evolution of QO techniques by summarizing and analyzing\nsignificant studies. Through an organized framework and categorization, we aim\nto consolidate existing QO techniques in RAG, elucidate their technological\nfoundations, and highlight their potential to enhance the versatility and\napplications of LLMs.\n","authors":["Mingyang Song","Mao Zheng"],"pdf_url":"https://arxiv.org/pdf/2412.17558v1.pdf","comment":"Ongoing Work"},{"id":"http://arxiv.org/abs/2412.17552v1","updated":"2024-12-23T13:20:06Z","published":"2024-12-23T13:20:06Z","title":"Comparative Analysis of Document-Level Embedding Methods for Similarity\n Scoring on Shakespeare Sonnets and Taylor Swift Lyrics","summary":" This study evaluates the performance of TF-IDF weighting, averaged Word2Vec\nembeddings, and BERT embeddings for document similarity scoring across two\ncontrasting textual domains. By analysing cosine similarity scores, the\nmethods' strengths and limitations are highlighted. The findings underscore\nTF-IDF's reliance on lexical overlap and Word2Vec's superior semantic\ngeneralisation, particularly in cross-domain comparisons. BERT demonstrates\nlower performance in challenging domains, likely due to insufficient\ndomainspecific fine-tuning.\n","authors":["Klara Kramer"],"pdf_url":"https://arxiv.org/pdf/2412.17552v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.17548v1","updated":"2024-12-23T13:08:48Z","published":"2024-12-23T13:08:48Z","title":"Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and\n Multi-Domain Testing","summary":" This paper presents a novel approach to fine-tuning the Qwen2-1.5B model for\nArabic language processing using Quantized Low-Rank Adaptation (QLoRA) on a\nsystem with only 4GB VRAM. We detail the process of adapting this large\nlanguage model to the Arabic domain, using diverse datasets including Bactrian,\nOpenAssistant, and Wikipedia Arabic corpora. Our methodology involves custom\ndata preprocessing, model configuration, and training optimization techniques\nsuch as gradient accumulation and mixed-precision training. We address specific\nchallenges in Arabic NLP, including morphological complexity, dialectal\nvariations, and diacritical mark handling. Experimental results over 10,000\ntraining steps show significant performance improvements, with the final loss\nconverging to 0.1083. We provide comprehensive analysis of GPU memory usage,\ntraining dynamics, and model evaluation across various Arabic language tasks,\nincluding text classification, question answering, and dialect identification.\nThe fine-tuned model demonstrates robustness to input perturbations and\nimproved handling of Arabic-specific linguistic phenomena. This research\ncontributes to multilingual AI by demonstrating a resource-efficient approach\nfor creating specialized language models, potentially democratizing access to\nadvanced NLP technologies for diverse linguistic communities. Our work paves\nthe way for future research in low-resource language adaptation and efficient\nfine-tuning of large language models.\n","authors":["Prakash Aryan"],"pdf_url":"https://arxiv.org/pdf/2412.17548v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17537v1","updated":"2024-12-23T12:59:43Z","published":"2024-12-23T12:59:43Z","title":"Domain adapted machine translation: What does catastrophic forgetting\n forget and why?","summary":" Neural Machine Translation (NMT) models can be specialized by domain\nadaptation, often involving fine-tuning on a dataset of interest. This process\nrisks catastrophic forgetting: rapid loss of generic translation quality.\nForgetting has been widely observed, with many mitigation methods proposed.\nHowever, the causes of forgetting and the relationship between forgetting and\nadaptation data are under-explored.\n This paper takes a novel approach to understanding catastrophic forgetting\nduring NMT adaptation by investigating the impact of the data. We provide a\nfirst investigation of what is forgotten, and why. We examine the relationship\nbetween forgetting and the in-domain data, and show that the amount and type of\nforgetting is linked to that data's target vocabulary coverage. Our findings\npave the way toward better informed NMT domain adaptation.\n","authors":["Danielle Saunders","Steve DeNeefe"],"pdf_url":"https://arxiv.org/pdf/2412.17537v1.pdf","comment":"EMNLP 2024"},{"id":"http://arxiv.org/abs/2412.17534v1","updated":"2024-12-23T12:58:30Z","published":"2024-12-23T12:58:30Z","title":"CiteBART: Learning to Generate Citations for Local Citation\n Recommendation","summary":" Citations are essential building blocks in scientific writing. The scientific\ncommunity is longing for support in their generation. Citation generation\ninvolves two complementary subtasks: Determining the citation worthiness of a\ncontext and, if it's worth it, proposing the best candidate papers for the\ncitation placeholder. The latter subtask is called local citation\nrecommendation (LCR). This paper proposes CiteBART, a custom BART pre-training\nbased on citation token masking to generate citations to achieve LCR. In the\nbase scheme, we mask the citation token in the local citation context to make\nthe citation prediction. In the global one, we concatenate the citing paper's\ntitle and abstract to the local citation context to learn to reconstruct the\ncitation token. CiteBART outperforms state-of-the-art approaches on the\ncitation recommendation benchmarks except for the smallest FullTextPeerRead\ndataset. The effect is significant in the larger benchmarks, e.g., Refseer and\nArXiv. We present a qualitative analysis and an ablation study to provide\ninsights into the workings of CiteBART. Our analyses confirm that its\ngenerative nature brings about a zero-shot capability.\n","authors":["Ege Yiğit Çelik","Selma Tekir"],"pdf_url":"https://arxiv.org/pdf/2412.17534v1.pdf","comment":"15 pages, 2 figures, 7 tables"},{"id":"http://arxiv.org/abs/2412.17533v1","updated":"2024-12-23T12:58:18Z","published":"2024-12-23T12:58:18Z","title":"Behind Closed Words: Creating and Investigating the forePLay Annotated\n Dataset for Polish Erotic Discourse","summary":" The surge in online content has created an urgent demand for robust detection\nsystems, especially in non-English contexts where current tools demonstrate\nsignificant limitations. We present forePLay, a novel Polish language dataset\nfor erotic content detection, featuring over 24k annotated sentences with a\nmultidimensional taxonomy encompassing ambiguity, violence, and social\nunacceptability dimensions. Our comprehensive evaluation demonstrates that\nspecialized Polish language models achieve superior performance compared to\nmultilingual alternatives, with transformer-based architectures showing\nparticular strength in handling imbalanced categories. The dataset and\naccompanying analysis establish essential frameworks for developing\nlinguistically-aware content moderation systems, while highlighting critical\nconsiderations for extending such capabilities to morphologically complex\nlanguages.\n","authors":["Anna Kołos","Katarzyna Lorenc","Emilia Wiśnios","Agnieszka Karlińska"],"pdf_url":"https://arxiv.org/pdf/2412.17533v1.pdf","comment":"The forePLay dataset and associated resources will be made publicly\n available for research purposes upon publication, in accordance with data\n sharing regulations"},{"id":"http://arxiv.org/abs/2411.18279v5","updated":"2024-12-23T12:48:43Z","published":"2024-11-27T12:13:39Z","title":"Large Language Model-Brained GUI Agents: A Survey","summary":" GUIs have long been central to human-computer interaction, providing an\nintuitive and visually-driven way to access and interact with digital systems.\nThe advent of LLMs, particularly multimodal models, has ushered in a new era of\nGUI automation. They have demonstrated exceptional capabilities in natural\nlanguage understanding, code generation, and visual processing. This has paved\nthe way for a new generation of LLM-brained GUI agents capable of interpreting\ncomplex GUI elements and autonomously executing actions based on natural\nlanguage instructions. These agents represent a paradigm shift, enabling users\nto perform intricate, multi-step tasks through simple conversational commands.\nTheir applications span across web navigation, mobile app interactions, and\ndesktop automation, offering a transformative user experience that\nrevolutionizes how individuals interact with software. This emerging field is\nrapidly advancing, with significant progress in both research and industry.\n To provide a structured understanding of this trend, this paper presents a\ncomprehensive survey of LLM-brained GUI agents, exploring their historical\nevolution, core components, and advanced techniques. We address research\nquestions such as existing GUI agent frameworks, the collection and utilization\nof data for training specialized GUI agents, the development of large action\nmodels tailored for GUI tasks, and the evaluation metrics and benchmarks\nnecessary to assess their effectiveness. Additionally, we examine emerging\napplications powered by these agents. Through a detailed analysis, this survey\nidentifies key research gaps and outlines a roadmap for future advancements in\nthe field. By consolidating foundational knowledge and state-of-the-art\ndevelopments, this work aims to guide both researchers and practitioners in\novercoming challenges and unlocking the full potential of LLM-brained GUI\nagents.\n","authors":["Chaoyun Zhang","Shilin He","Jiaxu Qian","Bowen Li","Liqun Li","Si Qin","Yu Kang","Minghua Ma","Guyue Liu","Qingwei Lin","Saravan Rajmohan","Dongmei Zhang","Qi Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.18279v5.pdf","comment":"The collection of papers reviewed in this survey will be hosted and\n regularly updated on the GitHub repository:\n https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a\n searchable webpage is available at https://aka.ms/gui-agent for easier access\n and exploration"},{"id":"http://arxiv.org/abs/2412.17522v1","updated":"2024-12-23T12:44:54Z","published":"2024-12-23T12:44:54Z","title":"DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM\n Jailbreak","summary":" Large Language Models (LLMs) are susceptible to generating harmful content\nwhen prompted with carefully crafted inputs, a vulnerability known as LLM\njailbreaking. As LLMs become more powerful, studying jailbreak methods is\ncritical to enhancing security and aligning models with human values.\nTraditionally, jailbreak techniques have relied on suffix addition or prompt\ntemplates, but these methods suffer from limited attack diversity. This paper\nintroduces DiffusionAttacker, an end-to-end generative approach for jailbreak\nrewriting inspired by diffusion models. Our method employs a\nsequence-to-sequence (seq2seq) text diffusion model as a generator,\nconditioning on the original prompt and guiding the denoising process with a\nnovel attack loss. Unlike previous approaches that use autoregressive LLMs to\ngenerate jailbreak prompts, which limit the modification of already generated\ntokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq\ndiffusion model, allowing more flexible token modifications. This approach\npreserves the semantic content of the original prompt while producing harmful\ncontent. Additionally, we leverage the Gumbel-Softmax technique to make the\nsampling process from the diffusion model's output distribution differentiable,\neliminating the need for iterative token search. Extensive experiments on\nAdvbench and Harmbench demonstrate that DiffusionAttacker outperforms previous\nmethods across various evaluation metrics, including attack success rate (ASR),\nfluency, and diversity.\n","authors":["Hao Wang","Hao Li","Junda Zhu","Xinyuan Wang","Chengwei Pan","MinLie Huang","Lei Sha"],"pdf_url":"https://arxiv.org/pdf/2412.17522v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.21315v4","updated":"2024-12-23T12:35:12Z","published":"2024-07-31T03:53:14Z","title":"Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal\n Nuances","summary":" Emotion recognition in speech is a challenging multimodal task that requires\nunderstanding both verbal content and vocal nuances. This paper introduces a\nnovel approach to emotion detection using Large Language Models (LLMs), which\nhave demonstrated exceptional capabilities in natural language understanding.\nTo overcome the inherent limitation of LLMs in processing audio inputs, we\npropose SpeechCueLLM, a method that translates speech characteristics into\nnatural language descriptions, allowing LLMs to perform multimodal emotion\nanalysis via text prompts without any architectural changes. Our method is\nminimal yet impactful, outperforming baseline models that require structural\nmodifications. We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD,\nshowing significant improvements in emotion recognition accuracy, particularly\nfor high-quality audio data. We also explore the effectiveness of various\nfeature representations and fine-tuning strategies for different LLMs. Our\nexperiments demonstrate that incorporating speech descriptions yields a more\nthan 2% increase in the average weighted F1 score on IEMOCAP (from 70.111% to\n72.596%).\n","authors":["Zehui Wu","Ziwei Gong","Lin Ai","Pengyuan Shi","Kaan Donbekci","Julia Hirschberg"],"pdf_url":"https://arxiv.org/pdf/2407.21315v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14546v3","updated":"2024-12-23T12:01:28Z","published":"2024-06-20T17:55:04Z","title":"Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from\n Disparate Training Data","summary":" One way to address safety risks from large language models (LLMs) is to\ncensor dangerous knowledge from their training data. While this removes the\nexplicit information, implicit information can remain scattered across various\ntraining documents. Could an LLM infer the censored knowledge by piecing\ntogether these implicit hints? As a step towards answering this question, we\nstudy inductive out-of-context reasoning (OOCR), a type of generalization in\nwhich LLMs infer latent information from evidence distributed across training\ndocuments and apply it to downstream tasks without in-context learning. Using a\nsuite of five tasks, we demonstrate that frontier LLMs can perform inductive\nOOCR. In one experiment we finetune an LLM on a corpus consisting only of\ndistances between an unknown city and other known cities. Remarkably, without\nin-context examples or Chain of Thought, the LLM can verbalize that the unknown\ncity is Paris and use this fact to answer downstream questions. Further\nexperiments show that LLMs trained only on individual coin flip outcomes can\nverbalize whether the coin is biased, and those trained only on pairs\n$(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR\nsucceeds in a range of cases, we also show that it is unreliable, particularly\nfor smaller LLMs learning complex structures. Overall, the ability of LLMs to\n\"connect the dots\" without explicit in-context learning poses a potential\nobstacle to monitoring and controlling the knowledge acquired by LLMs.\n","authors":["Johannes Treutlein","Dami Choi","Jan Betley","Samuel Marks","Cem Anil","Roger Grosse","Owain Evans"],"pdf_url":"https://arxiv.org/pdf/2406.14546v3.pdf","comment":"Accepted at NeurIPS 2024. 10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.17498v1","updated":"2024-12-23T11:55:33Z","published":"2024-12-23T11:55:33Z","title":"DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought","summary":" Recently, O1-like models have emerged as representative examples,\nillustrating the effectiveness of long chain-of-thought (CoT) in reasoning\ntasks such as math and coding tasks. In this paper, we introduce DRT-o1, an\nattempt to bring the success of long CoT to neural machine translation (MT).\nSpecifically, in view of the literature books that might involve similes and\nmetaphors, translating these texts to a target language is very difficult in\npractice due to cultural differences. In such cases, literal translation often\nfails to convey the intended meaning effectively. Even for professional human\ntranslators, considerable thought must be given to preserving semantics\nthroughout the translation process. To simulate LLMs' long thought ability in\nMT, we first mine sentences containing similes or metaphors from existing\nliterature books, and then develop a multi-agent framework to translate these\nsentences via long thought. In the multi-agent framework, a translator is used\nto iteratively translate the source sentence under the suggestions provided by\nan advisor. To ensure the effectiveness of the long thoughts, an evaluator is\nalso employed to judge whether the translation in the current round is better\nthan the previous one or not. In this manner, we collect tens of thousands of\nlong-thought MT data, which is used to train our DRT-o1. The experimental\nresults on literature translation demonstrate the effectiveness of the DRT-o1.\nUsing Qwen2.5-7B and Qwen2.5-14B as the backbones, the improvement brought by\nDRT-o1 achieves 7.33~8.26 BLEU and 1.66~3.36 CometScore. Besides, DRT-o1-7B can\noutperform QwQ-32B-Preview by 7.82 BLEU and 1.46 CometScore, showing its\neffectiveness. The project is available at https://github.com/krystalan/DRT-o1\n","authors":["Jiaan Wang","Fandong Meng","Yunlong Liang","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.17498v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17483v1","updated":"2024-12-23T11:24:04Z","published":"2024-12-23T11:24:04Z","title":"A Silver Bullet or a Compromise for Full Attention? A Comprehensive\n Study of Gist Token-based Context Compression","summary":" In this work, we provide a thorough investigation of gist-based context\ncompression methods to improve long-context processing in large language\nmodels. We focus on two key questions: (1) How well can these methods replace\nfull attention models? and (2) What potential failure patterns arise due to\ncompression? Through extensive experiments, we show that while gist-based\ncompression can achieve near-lossless performance on tasks like\nretrieval-augmented generation and long-document QA, it faces challenges in\ntasks like synthetic recall. Furthermore, we identify three key failure\npatterns: lost by the boundary, lost if surprise, and lost along the way. To\nmitigate these issues, we propose two effective strategies: fine-grained\nautoencoding, which enhances the reconstruction of original token information,\nand segment-wise token importance estimation, which adjusts optimization based\non token dependencies. Our work provides valuable insights into the\nunderstanding of gist token-based context compression and offers practical\nstrategies for improving compression capabilities.\n","authors":["Chenlong Deng","Zhisong Zhang","Kelong Mao","Shuaiyi Li","Xinting Huang","Dong Yu","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2412.17483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17481v1","updated":"2024-12-23T11:11:51Z","published":"2024-12-23T11:11:51Z","title":"A Survey on Multi-Generative Agent System: Recent Advances and New\n Frontiers","summary":" Multi-generative agent systems (MGASs) have become a research hotspot since\nthe rise of large language models (LLMs). However, with the continuous influx\nof new related works, the existing reviews struggle to capture them\ncomprehensively. This paper presents a comprehensive survey of these studies.\nWe first discuss the definition of MGAS, a framework encompassing much of\nprevious work. We provide an overview of the various applications of MGAS in\n(i) solving complex tasks, (ii) simulating specific scenarios, and (iii)\nevaluating generative agents. Building on previous studies, we also highlight\nseveral challenges and propose future directions for research in this field.\n","authors":["Shuaihang Chen","Yuanxing Liu","Wei Han","Weinan Zhang","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2412.17481v1.pdf","comment":"13 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.15265v2","updated":"2024-12-23T11:06:56Z","published":"2024-12-17T03:03:44Z","title":"Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large\n Language Models","summary":" With the rapid advancement of Large Language Models (LLMs), significant\nsafety concerns have emerged. Fundamentally, the safety of large language\nmodels is closely linked to the accuracy, comprehensiveness, and clarity of\ntheir understanding of safety knowledge, particularly in domains such as law,\npolicy and ethics. This factuality ability is crucial in determining whether\nthese models can be deployed and applied safely and compliantly within specific\nregions. To address these challenges and better evaluate the factuality ability\nof LLMs to answer short questions, we introduce the Chinese SafetyQA benchmark.\nChinese SafetyQA has several properties (i.e., Chinese, Diverse, High-quality,\nStatic, Easy-to-evaluate, Safety-related, Harmless). Based on Chinese SafetyQA,\nwe perform a comprehensive evaluation on the factuality abilities of existing\nLLMs and analyze how these capabilities relate to LLM abilities, e.g., RAG\nability and robustness against attacks.\n","authors":["Yingshui Tan","Boren Zheng","Baihui Zheng","Kerui Cao","Huiyun Jing","Jincheng Wei","Jiaheng Liu","Yancheng He","Wenbo Su","Xiangyong Zhu","Bo Zheng","Kaifu Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.15265v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18057v2","updated":"2024-12-23T10:48:15Z","published":"2024-10-23T17:30:50Z","title":"CLEAR: Character Unlearning in Textual and Visual Modalities","summary":" Machine Unlearning (MU) is critical for enhancing privacy and security in\ndeep learning models, particularly in large multimodal language models (MLLMs),\nby removing specific private or hazardous information. While MU has made\nsignificant progress in textual and visual modalities, multimodal unlearning\n(MMU) remains significantly underexplored, partially due to the absence of a\nsuitable open-source benchmark. To address this, we introduce CLEAR, a new\nbenchmark designed to evaluate MMU methods. CLEAR contains 200 fictitious\nindividuals and 3,700 images linked with corresponding question-answer pairs,\nenabling a thorough evaluation across modalities. We assess 10 MU methods,\nadapting them for MMU, and highlight new challenges specific to multimodal\nforgetting. The dataset is available at\nhttps://huggingface.co/datasets/therem/CLEAR\n","authors":["Alexey Dontsov","Dmitrii Korzh","Alexey Zhavoronkin","Boris Mikheev","Denis Bobkov","Aibek Alanov","Oleg Y. Rogov","Ivan Oseledets","Elena Tutubalina"],"pdf_url":"https://arxiv.org/pdf/2410.18057v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15714v2","updated":"2024-12-23T10:45:32Z","published":"2024-12-20T09:37:02Z","title":"AutoLife: Automatic Life Journaling with Smartphones and LLMs","summary":" This paper introduces a novel mobile sensing application - life journaling -\ndesigned to generate semantic descriptions of users' daily lives. We present\nAutoLife, an automatic life journaling system based on commercial smartphones.\nAutoLife only inputs low-cost sensor data (without photos or audio) from\nsmartphones and can automatically generate comprehensive life journals for\nusers. To achieve this, we first derive time, motion, and location contexts\nfrom multimodal sensor data, and harness the zero-shot capabilities of Large\nLanguage Models (LLMs), enriched with commonsense knowledge about human lives,\nto interpret diverse contexts and generate life journals. To manage the task\ncomplexity and long sensing duration, a multilayer framework is proposed, which\ndecomposes tasks and seamlessly integrates LLMs with other techniques for life\njournaling. This study establishes a real-life dataset as a benchmark and\nextensive experiment results demonstrate that AutoLife produces accurate and\nreliable life journals.\n","authors":["Huatao Xu","Panrong Tong","Mo Li","Mani Srivastava"],"pdf_url":"https://arxiv.org/pdf/2412.15714v2.pdf","comment":"13 pages"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2407.07755v2","updated":"2024-12-23T23:56:06Z","published":"2024-07-10T15:28:02Z","title":"Neural Geometry Processing via Spherical Neural Surfaces","summary":" Neural surfaces (e.g., neural map encoding, deep implicits and neural\nradiance fields) have recently gained popularity because of their generic\nstructure (e.g., multi-layer perceptron) and easy integration with modern\nlearning-based setups. Traditionally, we have a rich toolbox of geometry\nprocessing algorithms designed for polygonal meshes to analyze and operate on\nsurface geometry. In the absence of an analogous toolbox, neural\nrepresentations are typically discretized and converted into a mesh, before\napplying any geometry processing algorithm. This is unsatisfactory and, as we\ndemonstrate, unnecessary. In this work, we propose a spherical neural surface\nrepresentation for genus-0 surfaces and demonstrate how to compute core\ngeometric operators directly on this representation. Namely, we estimate\nsurface normals and first and second fundamental forms of the surface, as well\nas compute surface gradient, surface divergence and Laplace-Beltrami operator\non scalar/vector fields defined on the surface. Our representation is fully\nseamless, overcoming a key limitation of similar explicit representations such\nas Neural Surface Maps [Morreale et al. 2021]. These operators, in turn, enable\ngeometry processing directly on the neural representations without any\nunnecessary meshing. We demonstrate illustrative applications in (neural)\nspectral analysis, heat flow and mean curvature flow, and evaluate robustness\nto isometric shape variations. We propose theoretical formulations and validate\ntheir numerical estimates, against analytical estimates, mesh-based baselines,\nand neural alternatives, where available. By systematically linking neural\nsurface representations with classical geometry processing algorithms, we\nbelieve that this work can become a key ingredient in enabling neural geometry\nprocessing. Code will be released upon acceptance, accessible from the project\nwebpage.\n","authors":["Romy Williamson","Niloy J. Mitra"],"pdf_url":"https://arxiv.org/pdf/2407.07755v2.pdf","comment":"14 pages, 14 figures"},{"id":"http://arxiv.org/abs/2408.14947v4","updated":"2024-12-23T23:33:41Z","published":"2024-08-27T10:44:34Z","title":"ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line\n Scanning","summary":" Detecting unexpected objects (anomalies) in real time has great potential for\nmonitoring, managing, and protecting the environment. Hyperspectral line-scan\ncameras are a low-cost solution that enhance confidence in anomaly detection\nover RGB and multispectral imagery. However, existing line-scan algorithms are\ntoo slow when using small computers (e.g. those onboard a drone or small\nsatellite), do not adapt to changing scenery, or lack robustness against\ngeometric distortions. This paper introduces the Exponentially moving RX\nalgorithm (ERX) to address these issues, and compares it with four existing\nRX-based anomaly detection methods for hyperspectral line scanning. Three large\nand more complex datasets are also introduced to better assess the practical\nchallenges when using line-scan cameras (two hyperspectral and one\nmultispectral). ERX was evaluated using a Jetson Xavier NX edge computing\nmodule (6-core CPU, 8GB RAM, 20W power draw), achieving the best combination of\nspeed and detection performance. ERX was 9 times faster than the next-best\nalgorithm on the dataset with the highest number of bands (108 band), with an\naverage speed of 561 lines per second on the Jetson. It achieved a 29.3% AUC\nimprovement over the next-best algorithm on the most challenging dataset, while\nshowing greater adaptability through consistently high AUC scores regardless of\nthe camera's starting location. ERX performed robustly across all datasets,\nachieving an AUC of 0.941 on a drone-collected hyperspectral line scan dataset\nwithout geometric corrections (a 16.9% improvement over existing algorithms).\nThis work enables future research on the detection of anomalous objects in real\ntime, adaptive and automatic threshold selection, and real-time field tests.\nThe datasets and the Python code are openly available at:\nhttps://github.com/WiseGamgee/HyperAD, promoting accessibility and future work.\n","authors":["Samuel Garske","Bradley Evans","Christopher Artlett","KC Wong"],"pdf_url":"https://arxiv.org/pdf/2408.14947v4.pdf","comment":"17 pages, 13 figures, 4 tables, code and datasets accessible at\n https://github.com/WiseGamgee/HyperAD"},{"id":"http://arxiv.org/abs/2412.18038v1","updated":"2024-12-23T23:17:44Z","published":"2024-12-23T23:17:44Z","title":"AA-SGAN: Adversarially Augmented Social GAN with Synthetic Data","summary":" Accurately predicting pedestrian trajectories is crucial in applications such\nas autonomous driving or service robotics, to name a few. Deep generative\nmodels achieve top performance in this task, assuming enough labelled\ntrajectories are available for training. To this end, large amounts of\nsynthetically generated, labelled trajectories exist (e.g., generated by video\ngames). However, such trajectories are not meant to represent pedestrian motion\nrealistically and are ineffective at training a predictive model. We propose a\nmethod and an architecture to augment synthetic trajectories at training time\nand with an adversarial approach. We show that trajectory augmentation at\ntraining time unleashes significant gains when a state-of-the-art generative\nmodel is evaluated over real-world trajectories.\n","authors":["Mirko Zaffaroni","Federico Signoretta","Marco Grangetto","Attilio Fiandrotti"],"pdf_url":"https://arxiv.org/pdf/2412.18038v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.18027v1","updated":"2024-12-23T22:39:41Z","published":"2024-12-23T22:39:41Z","title":"LayerDropBack: A Universally Applicable Approach for Accelerating\n Training of Deep Networks","summary":" Training very deep convolutional networks is challenging, requiring\nsignificant computational resources and time. Existing acceleration methods\noften depend on specific architectures or require network modifications. We\nintroduce LayerDropBack (LDB), a simple yet effective method to accelerate\ntraining across a wide range of deep networks. LDB introduces randomness only\nin the backward pass, maintaining the integrity of the forward pass,\nguaranteeing that the same network is used during both training and inference.\nLDB can be seamlessly integrated into the training process of any model without\naltering its architecture, making it suitable for various network topologies.\nOur extensive experiments across multiple architectures (ViT, Swin Transformer,\nEfficientNet, DLA) and datasets (CIFAR-100, ImageNet) show significant training\ntime reductions of 16.93\\% to 23.97\\%, while preserving or even enhancing model\naccuracy. Code is available at \\url{https://github.com/neiterman21/LDB}.\n","authors":["Evgeny Hershkovitch Neiterman","Gil Ben-Artzi"],"pdf_url":"https://arxiv.org/pdf/2412.18027v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17991v1","updated":"2024-12-23T21:20:32Z","published":"2024-12-23T21:20:32Z","title":"Online Adaptation for Myographic Control of Natural Dexterous Hand and\n Finger Movements","summary":" One of the most elusive goals in myographic prosthesis control is the ability\nto reliably decode continuous positions simultaneously across multiple\ndegrees-of-freedom. Goal: To demonstrate dexterous, natural, biomimetic finger\nand wrist control of the highly advanced robotic Modular Prosthetic Limb.\nMethods: We combine sequential temporal regression models and reinforcement\nlearning using myographic signals to predict continuous simultaneous\npredictions of 7 finger and wrist degrees-of-freedom for 9 non-amputee human\nsubjects in a minimally-constrained freeform training process. Results: We\ndemonstrate highly dexterous 7 DoF position-based regression for prosthesis\ncontrol from EMG signals, with significantly lower error rates than traditional\napproaches (p < 0.001) and nearly zero prediction response time delay (p <\n0.001). Their performance can be continuously improved at any time using our\nfreeform reinforcement process. Significance: We have demonstrated the most\ndexterous, biomimetic, and natural prosthesis control performance ever obtained\nfrom the surface EMG signal. Our reinforcement approach allowed us to abandon\nstandard training protocols and simply allow the subject to move in any desired\nway while our models adapt. Conclusions: This work redefines the\nstate-of-the-art in myographic decoding in terms of the reliability,\nresponsiveness, and movement complexity available from prosthesis control\nsystems. The present-day emergence and convergence of advanced algorithmic\nmethods, experiment protocols, dexterous robotic prostheses, and sensor\nmodalities represents a unique opportunity to finally realize our ultimate goal\nof achieving fully restorative natural upper-limb function for amputees.\n","authors":["Joseph L. Betthauser","Rebecca Greene","Ananya Dhawan","John T. Krall","Christopher L. Hunt","Gyorgy Levay","Rahul R. Kaliki","Matthew S. Fifer","Siddhartha Sikdar","Nitish V. Thakor"],"pdf_url":"https://arxiv.org/pdf/2412.17991v1.pdf","comment":"Modified from Chapter 5 of J. L. Betthauser, \"Robust Adaptive\n Strategies for Myographic Prosthesis Movement Decoding,\" Doctoral\n Dissertation, Dept. of Electrical and Computer Engr, Johns Hopkins\n University, 2020"},{"id":"http://arxiv.org/abs/2412.17984v1","updated":"2024-12-23T21:06:08Z","published":"2024-12-23T21:06:08Z","title":"ICPR 2024 Competition on Domain Adaptation and GEneralization for\n Character Classification (DAGECC)","summary":" In this companion paper for the DAGECC (Domain Adaptation and GEneralization\nfor Character Classification) competition organized within the frame of the\nICPR 2024 conference, we present the general context of the tasks we proposed\nto the community, we introduce the data that were prepared for the competition\nand we provide a summary of the results along with a description of the top\nthree winning entries. The competition was centered around domain adaptation\nand generalization, and our core aim is to foster interest and facilitate\nadvancement on these topics by providing a high-quality, lightweight, real\nworld dataset able to support fast prototyping and validation of novel ideas.\n","authors":["Sofia Marino","Jennifer Vandoni","Emanuel Aldea","Ichraq Lemghari","Sylvie Le Hégarat-Mascle","Frédéric Jurie"],"pdf_url":"https://arxiv.org/pdf/2412.17984v1.pdf","comment":"Companion paper for the ICPR 2024 Competition on Domain Adaptation\n and GEneralization for Character Classification (DAGECC)"},{"id":"http://arxiv.org/abs/2412.17982v1","updated":"2024-12-23T21:01:32Z","published":"2024-12-23T21:01:32Z","title":"Unsupervised learning of spatially varying regularization for\n diffeomorphic image registration","summary":" Spatially varying regularization accommodates the deformation variations that\nmay be necessary for different anatomical regions during deformable image\nregistration. Historically, optimization-based registration models have\nharnessed spatially varying regularization to address anatomical subtleties.\nHowever, most modern deep learning-based models tend to gravitate towards\nspatially invariant regularization, wherein a homogenous regularization\nstrength is applied across the entire image, potentially disregarding localized\nvariations. In this paper, we propose a hierarchical probabilistic model that\nintegrates a prior distribution on the deformation regularization strength,\nenabling the end-to-end learning of a spatially varying deformation regularizer\ndirectly from the data. The proposed method is straightforward to implement and\neasily integrates with various registration network architectures.\nAdditionally, automatic tuning of hyperparameters is achieved through Bayesian\noptimization, allowing efficient identification of optimal hyperparameters for\nany given registration task. Comprehensive evaluations on publicly available\ndatasets demonstrate that the proposed method significantly improves\nregistration performance and enhances the interpretability of deep\nlearning-based registration, all while maintaining smooth deformations.\n","authors":["Junyu Chen","Shuwen Wei","Yihao Liu","Zhangxing Bian","Yufan He","Aaron Carass","Harrison Bai","Yong Du"],"pdf_url":"https://arxiv.org/pdf/2412.17982v1.pdf","comment":"Code available at http://bit.ly/3BrXGxz"},{"id":"http://arxiv.org/abs/2412.16050v2","updated":"2024-12-23T20:51:26Z","published":"2024-12-20T16:52:11Z","title":"Label-Efficient Data Augmentation with Video Diffusion Models for\n Guidewire Segmentation in Cardiac Fluoroscopy","summary":" The accurate segmentation of guidewires in interventional cardiac fluoroscopy\nvideos is crucial for computer-aided navigation tasks. Although deep learning\nmethods have demonstrated high accuracy and robustness in wire segmentation,\nthey require substantial annotated datasets for generalizability, underscoring\nthe need for extensive labeled data to enhance model performance. To address\nthis challenge, we propose the Segmentation-guided Frame-consistency Video\nDiffusion Model (SF-VD) to generate large collections of labeled fluoroscopy\nvideos, augmenting the training data for wire segmentation networks. SF-VD\nleverages videos with limited annotations by independently modeling scene\ndistribution and motion distribution. It first samples the scene distribution\nby generating 2D fluoroscopy images with wires positioned according to a\nspecified input mask, and then samples the motion distribution by progressively\ngenerating subsequent frames, ensuring frame-to-frame coherence through a\nframe-consistency strategy. A segmentation-guided mechanism further refines the\nprocess by adjusting wire contrast, ensuring a diverse range of visibility in\nthe synthesized image. Evaluation on a fluoroscopy dataset confirms the\nsuperior quality of the generated videos and shows significant improvements in\nguidewire segmentation.\n","authors":["Shaoyan Pan","Yikang Liu","Lin Zhao","Eric Z. Chen","Xiao Chen","Terrence Chen","Shanhui Sun"],"pdf_url":"https://arxiv.org/pdf/2412.16050v2.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2412.17975v1","updated":"2024-12-23T20:42:15Z","published":"2024-12-23T20:42:15Z","title":"Improving Sickle Cell Disease Classification: A Fusion of Conventional\n Classifiers, Segmented Images, and Convolutional Neural Networks","summary":" Sickle cell anemia, which is characterized by abnormal erythrocyte\nmorphology, can be detected using microscopic images. Computational techniques\nin medicine enhance the diagnosis and treatment efficiency. However, many\ncomputational techniques, particularly those based on Convolutional Neural\nNetworks (CNNs), require high resources and time for training, highlighting the\nresearch opportunities in methods with low computational overhead. In this\npaper, we propose a novel approach combining conventional classifiers,\nsegmented images, and CNNs for the automated classification of sickle cell\ndisease. We evaluated the impact of segmented images on classification,\nproviding insight into deep learning integration. Our results demonstrate that\nusing segmented images and CNN features with an SVM achieves an accuracy of\n96.80%. This finding is relevant for computationally efficient scenarios,\npaving the way for future research and advancements in medical-image analysis.\n","authors":["Victor Júnio Alcântara Cardoso","Rodrigo Moreira","João Fernando Mari","Larissa Ferreira Rodrigues Moreira"],"pdf_url":"https://arxiv.org/pdf/2412.17975v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2412.17968v1","updated":"2024-12-23T20:33:34Z","published":"2024-12-23T20:33:34Z","title":"A Multimodal Fusion Framework for Bridge Defect Detection with\n Cross-Verification","summary":" This paper presents a pilot study introducing a multimodal fusion framework\nfor the detection and analysis of bridge defects, integrating Non-Destructive\nEvaluation (NDE) techniques with advanced image processing to enable precise\nstructural assessment. By combining data from Impact Echo (IE) and Ultrasonic\nSurface Waves (USW) methods, this preliminary investigation focuses on\nidentifying defect-prone regions within concrete structures, emphasizing\ncritical indicators such as delamination and debonding. Using geospatial\nanalysis with alpha shapes, fusion of defect points, and unified lane\nboundaries, the proposed framework consolidates disparate data sources to\nenhance defect localization and facilitate the identification of overlapping\ndefect regions. Cross-verification with adaptive image processing further\nvalidates detected defects by aligning their coordinates with visual data,\nutilizing advanced contour-based mapping and bounding box techniques for\nprecise defect identification. The experimental results, with an F1 score of\n0.83, demonstrate the potential efficacy of the approach in improving defect\nlocalization, reducing false positives, and enhancing detection accuracy, which\nprovides a foundation for future research and larger-scale validation. This\npreliminary exploration establishes the framework as a promising tool for\nefficient bridge health assessment, with implications for proactive structural\nmonitoring and maintenance.\n","authors":["Ravi Datta Rachuri","Duoduo Liao","Samhita Sarikonda","Datha Vaishnavi Kondur"],"pdf_url":"https://arxiv.org/pdf/2412.17968v1.pdf","comment":"Accepted by IEEE Big Data 2024"}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.18042v1","updated":"2024-12-23T23:29:29Z","published":"2024-12-23T23:29:29Z","title":"Time-Probability Dependent Knowledge Extraction in IoT-enabled Smart\n Building","summary":" Smart buildings incorporate various emerging Internet of Things (IoT)\napplications for comprehensive management of energy efficiency, human comfort,\nautomation, and security. However, the development of a knowledge extraction\nframework is fundamental. Currently, there is a lack of a unified and practical\nframework for modeling heterogeneous sensor data within buildings. In this\npaper, we propose a practical inference framework for extracting\nstatus-to-event knowledge within smart building. Our proposal includes\nIoT-based API integration, ontology model design, and time probability\ndependent knowledge extraction methods. The Building Topology Ontology (BOT)\nwas leveraged to construct spatial relations among sensors and spaces within\nthe building. We utilized Apache Jena Fuseki's SPARQL server for storing and\nquerying the RDF triple data. Two types of knowledge could be extracted:\ntimestamp-based probability for abnormal event detection and time\ninterval-based probability for conjunction of multiple events. We conducted\nexperiments (over a 78-day period) in a real smart building environment. The\ndata of light and elevator states has been collected for evaluation. The\nevaluation revealed several inferred events, such as room occupancy, elevator\ntrajectory tracking, and the conjunction of both events. The numerical values\nof detected event counts and probability demonstrate the potential for\nautomatic control in the smart building.\n","authors":["Hangli Ge","Hirotsugu Seike","Noboru Koshizuka"],"pdf_url":"https://arxiv.org/pdf/2412.18042v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17998v1","updated":"2024-12-23T21:42:31Z","published":"2024-12-23T21:42:31Z","title":"WavePulse: Real-time Content Analytics of Radio Livestreams","summary":" Radio remains a pervasive medium for mass information dissemination, with\nAM/FM stations reaching more Americans than either smartphone-based social\nnetworking or live television. Increasingly, radio broadcasts are also streamed\nonline and accessed over the Internet. We present WavePulse, a framework that\nrecords, documents, and analyzes radio content in real-time. While our\nframework is generally applicable, we showcase the efficacy of WavePulse in a\ncollaborative project with a team of political scientists focusing on the 2024\nPresidential Elections. We use WavePulse to monitor livestreams of 396 news\nradio stations over a period of three months, processing close to 500,000 hours\nof audio streams. These streams were converted into time-stamped, diarized\ntranscripts and analyzed to track answer key political science questions at\nboth the national and state levels. Our analysis revealed how local issues\ninteracted with national trends, providing insights into information flow. Our\nresults demonstrate WavePulse's efficacy in capturing and analyzing content\nfrom radio livestreams sourced from the Web. Code and dataset can be accessed\nat \\url{https://wave-pulse.io}.\n","authors":["Govind Mittal","Sarthak Gupta","Shruti Wagle","Chirag Chopra","Anthony J DeMattee","Nasir Memon","Mustaque Ahamad","Chinmay Hegde"],"pdf_url":"https://arxiv.org/pdf/2412.17998v1.pdf","comment":"22 Pages: 10 main + 12 appendix, 24 figures. Access code and dataset\n at https://wave-pulse.io"},{"id":"http://arxiv.org/abs/2412.15241v2","updated":"2024-12-23T17:59:23Z","published":"2024-12-13T09:52:25Z","title":"Quantifying Positional Biases in Text Embedding Models","summary":" Embedding models are crucial for tasks in Information Retrieval (IR) and\nsemantic similarity measurement, yet their handling of longer texts and\nassociated positional biases remains underexplored. In this study, we\ninvestigate the impact of content position and input size on text embeddings.\nOur experiments reveal that embedding models, irrespective of their positional\nencoding mechanisms, disproportionately prioritize the beginning of an input.\nAblation studies demonstrate that insertion of irrelevant text or removal at\nthe start of a document reduces cosine similarity between altered and original\nembeddings by up to 12.3\\% more than ablations at the end. Regression analysis\nfurther confirms this bias, with sentence importance declining as position\nmoves further from the start, even with with content-agnosticity. We\nhypothesize that this effect arises from pre-processing strategies and chosen\npositional encoding techniques. These findings quantify the sensitivity of\nretrieval systems and suggest a new lens towards embedding model robustness.\n","authors":["Reagan J. Lee","Samarth Goel","Kannan Ramchandran"],"pdf_url":"https://arxiv.org/pdf/2412.15241v2.pdf","comment":"13 pages, 11 figures, NeurIPS"},{"id":"http://arxiv.org/abs/2412.10571v3","updated":"2024-12-23T16:12:59Z","published":"2024-12-13T21:28:17Z","title":"Evidence Contextualization and Counterfactual Attribution for\n Conversational QA over Heterogeneous Data with RAG Systems","summary":" Retrieval Augmented Generation (RAG) works as a backbone for interacting with\nan enterprise's own data via Conversational Question Answering (ConvQA). In a\nRAG system, a retriever fetches passages from a collection in response to a\nquestion, which are then included in the prompt of a large language model (LLM)\nfor generating a natural language (NL) answer. However, several RAG systems\ntoday suffer from two shortcomings: (i) retrieved passages usually contain\ntheir raw text and lack appropriate document context, negatively impacting both\nretrieval and answering quality; and (ii) attribution strategies that explain\nanswer generation typically rely only on similarity between the answer and the\nretrieved passages, thereby only generating plausible but not causal\nexplanations. In this work, we demonstrate RAGONITE, a RAG system that remedies\nthe above concerns by: (i) contextualizing evidence with source metadata and\nsurrounding text; and (ii) computing counterfactual attribution, a causal\nexplanation approach where the contribution of an evidence to an answer is\ndetermined by the similarity of the original response to the answer obtained by\nremoving that evidence. To evaluate our proposals, we release a new benchmark\nConfQuestions: it has 300 hand-created conversational questions, each in\nEnglish and German, coupled with ground truth URLs, completed questions, and\nanswers from 215 public Confluence pages. These documents are typical of\nenterprise wiki spaces with heterogeneous elements. Experiments with RAGONITE\non ConfQuestions show the viability of our ideas: contextualization improves\nRAG performance, and counterfactual explanations outperform standard\nattribution.\n","authors":["Rishiraj Saha Roy","Joel Schlotthauer","Chris Hinze","Andreas Foltyn","Luzian Hahn","Fabian Kuech"],"pdf_url":"https://arxiv.org/pdf/2412.10571v3.pdf","comment":"Accepted at WSDM 2025, 8 pages"},{"id":"http://arxiv.org/abs/2412.17593v1","updated":"2024-12-23T14:10:09Z","published":"2024-12-23T14:10:09Z","title":"Leveraging Memory Retrieval to Enhance LLM-based Generative\n Recommendation","summary":" Leveraging Large Language Models (LLMs) to harness user-item interaction\nhistories for item generation has emerged as a promising paradigm in generative\nrecommendation. However, the limited context window of LLMs often restricts\nthem to focusing on recent user interactions only, leading to the neglect of\nlong-term interests involved in the longer histories. To address this\nchallenge, we propose a novel Automatic Memory-Retrieval framework (AutoMR),\nwhich is capable of storing long-term interests in the memory and extracting\nrelevant information from it for next-item generation within LLMs. Extensive\nexperimental results on two real-world datasets demonstrate the effectiveness\nof our proposed AutoMR framework in utilizing long-term interests for\ngenerative recommendation.\n","authors":["Chengbing Wang","Yang Zhang","Fengbin Zhu","Jizhi Zhang","Tianhao Shi","Fuli Feng"],"pdf_url":"https://arxiv.org/pdf/2412.17593v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02048v2","updated":"2024-12-23T13:26:56Z","published":"2024-06-04T07:29:59Z","title":"Your Causal Self-Attentive Recommender Hosts a Lonely Neighborhood","summary":" In the context of sequential recommendation, a pivotal issue pertains to the\ncomparative analysis between bi-directional/auto-encoding (AE) and\nuni-directional/auto-regressive (AR) attention mechanisms, where the\nconclusions regarding architectural and performance superiority remain\ninconclusive. Previous efforts in such comparisons primarily involve\nsummarizing existing works to identify a consensus or conducting ablation\nstudies on peripheral modeling techniques, such as choices of loss functions.\nHowever, far fewer efforts have been made in (1) theoretical and (2) extensive\nempirical analysis of the self-attention module, the very pivotal structure on\nwhich performance and designing insights should be anchored. In this work, we\nfirst provide a comprehensive theoretical analysis of AE/AR attention matrix in\nthe aspect of (1) sparse local inductive bias, a.k.a neighborhood effects, and\n(2) low rank approximation. Analytical metrics reveal that the AR attention\nexhibits sparse neighborhood effects suitable for generally sparse\nrecommendation scenarios. Secondly, to support our theoretical analysis, we\nconduct extensive empirical experiments on comparing vanilla and variant AE/AR\nattention on five popular benchmarks with AR performing better overall. Results\nbased on adaptive tuning, modularized design and Huggingface are reported.\nLastly, we shed light on future design choices for performant self-attentive\nrecommenders. We make our code and data available at\nhttps://github.com/yueqirex/Self-Attention-Direction-Check.\n","authors":["Yueqi Wang","Zhankui He","Zhenrui Yue","Julian McAuley","Dong Wang"],"pdf_url":"https://arxiv.org/pdf/2406.02048v2.pdf","comment":"Accepted to WSDM'25. Updates from the previous version: Added\n theoretical attention matrix analysis"},{"id":"http://arxiv.org/abs/2412.17552v1","updated":"2024-12-23T13:20:06Z","published":"2024-12-23T13:20:06Z","title":"Comparative Analysis of Document-Level Embedding Methods for Similarity\n Scoring on Shakespeare Sonnets and Taylor Swift Lyrics","summary":" This study evaluates the performance of TF-IDF weighting, averaged Word2Vec\nembeddings, and BERT embeddings for document similarity scoring across two\ncontrasting textual domains. By analysing cosine similarity scores, the\nmethods' strengths and limitations are highlighted. The findings underscore\nTF-IDF's reliance on lexical overlap and Word2Vec's superior semantic\ngeneralisation, particularly in cross-domain comparisons. BERT demonstrates\nlower performance in challenging domains, likely due to insufficient\ndomainspecific fine-tuning.\n","authors":["Klara Kramer"],"pdf_url":"https://arxiv.org/pdf/2412.17552v1.pdf","comment":"9 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.17534v1","updated":"2024-12-23T12:58:30Z","published":"2024-12-23T12:58:30Z","title":"CiteBART: Learning to Generate Citations for Local Citation\n Recommendation","summary":" Citations are essential building blocks in scientific writing. The scientific\ncommunity is longing for support in their generation. Citation generation\ninvolves two complementary subtasks: Determining the citation worthiness of a\ncontext and, if it's worth it, proposing the best candidate papers for the\ncitation placeholder. The latter subtask is called local citation\nrecommendation (LCR). This paper proposes CiteBART, a custom BART pre-training\nbased on citation token masking to generate citations to achieve LCR. In the\nbase scheme, we mask the citation token in the local citation context to make\nthe citation prediction. In the global one, we concatenate the citing paper's\ntitle and abstract to the local citation context to learn to reconstruct the\ncitation token. CiteBART outperforms state-of-the-art approaches on the\ncitation recommendation benchmarks except for the smallest FullTextPeerRead\ndataset. The effect is significant in the larger benchmarks, e.g., Refseer and\nArXiv. We present a qualitative analysis and an ablation study to provide\ninsights into the workings of CiteBART. Our analyses confirm that its\ngenerative nature brings about a zero-shot capability.\n","authors":["Ege Yiğit Çelik","Selma Tekir"],"pdf_url":"https://arxiv.org/pdf/2412.17534v1.pdf","comment":"15 pages, 2 figures, 7 tables"},{"id":"http://arxiv.org/abs/2406.12052v2","updated":"2024-12-23T08:30:47Z","published":"2024-06-17T19:45:21Z","title":"UniGLM: Training One Unified Language Model for Text-Attributed Graph\n Embedding","summary":" Representation learning on text-attributed graphs (TAGs), where nodes are\nrepresented by textual descriptions, is crucial for textual and relational\nknowledge systems and recommendation systems. Currently, state-of-the-art\nembedding methods for TAGs primarily focus on fine-tuning language models\n(e.g., BERT) using structure-aware training signals. While effective, these\nmethods are tailored for individual TAG and cannot generalize across various\ngraph scenarios. Given the shared textual space, leveraging multiple TAGs for\njoint fine-tuning, aligning text and graph structure from different aspects,\nwould be more beneficial. Motivated by this, we introduce a novel Unified Graph\nLanguage Model (UniGLM) framework, the first graph embedding model that\ngeneralizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM\nis trained over multiple TAGs with different domains and scales using\nself-supervised contrastive learning. UniGLM includes an adaptive positive\nsample selection technique for identifying structurally similar nodes and a\nlazy contrastive module that is devised to accelerate training by minimizing\nrepetitive encoding calculations. Extensive empirical results across 9\nbenchmark TAGs demonstrate UniGLM's efficacy against leading embedding\nbaselines in terms of generalization (various downstream tasks and backbones)\nand transfer learning (in and out of domain scenarios). The code is available\nat https://github.com/NYUSHCS/UniGLM.\n","authors":["Yi Fang","Dongzhe Fan","Sirui Ding","Ninghao Liu","Qiaoyu Tan"],"pdf_url":"https://arxiv.org/pdf/2406.12052v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17374v1","updated":"2024-12-23T08:15:34Z","published":"2024-12-23T08:15:34Z","title":"Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark","summary":" Multi Scenario Recommendation (MSR) tasks, referring to building a unified\nmodel to enhance performance across all recommendation scenarios, have recently\ngained much attention. However, current research in MSR faces two significant\nchallenges that hinder the field's development: the absence of uniform\nprocedures for multi-scenario dataset processing, thus hindering fair\ncomparisons, and most models being closed-sourced, which complicates\ncomparisons with current SOTA models. Consequently, we introduce our benchmark,\n\\textbf{Scenario-Wise Rec}, which comprises 6 public datasets and 12 benchmark\nmodels, along with a training and evaluation pipeline. Additionally, we\nvalidated the benchmark using an industrial advertising dataset, reinforcing\nits reliability and applicability in real-world scenarios. We aim for this\nbenchmark to offer researchers valuable insights from prior work, enabling the\ndevelopment of novel models based on our benchmark and thereby fostering a\ncollaborative research ecosystem in MSR. Our source code is also publicly\navailable.\n","authors":["Xiaopeng Li","Jingtong Gao","Pengyue Jia","Yichao Wang","Wanyu Wang","Yejing Wang","Yuhao Wang","Huifeng Guo","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2412.17374v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17364v1","updated":"2024-12-23T07:55:22Z","published":"2024-12-23T07:55:22Z","title":"Efficient fine-tuning methodology of text embedding models for\n information retrieval: contrastive learning penalty (clp)","summary":" Text embedding models play a crucial role in natural language processing,\nparticularly in information retrieval, and their importance is further\nhighlighted with the recent utilization of RAG (Retrieval- Augmented\nGeneration). This study presents an efficient fine-tuning methodology\nencompassing data selection, loss function, and model architecture to enhance\nthe information retrieval performance of pre-trained text embedding models. In\nparticular, this study proposes a novel Contrastive Learning Penalty function\nthat overcomes the limitations of existing Contrastive Learning. The proposed\nmethodology achieves significant performance improvements over existing methods\nin document retrieval tasks. This study is expected to contribute to improving\nthe performance of information retrieval systems through fine-tuning of text\nembedding models. The code for this study can be found at\nhttps://github.com/CreaLabs/Enhanced-BGE-M3-with-CLP-and-MoE, and the\nbest-performing model can be found at https://huggingface.co/CreaLabs.\n","authors":["Jeongsu Yu"],"pdf_url":"https://arxiv.org/pdf/2412.17364v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17310v1","updated":"2024-12-23T06:04:14Z","published":"2024-12-23T06:04:14Z","title":"Popularity Estimation and New Bundle Generation using Content and\n Context based Embeddings","summary":" Recommender systems create enormous value for businesses and their consumers.\nThey increase revenue for businesses while improving the consumer experience by\nrecommending relevant products amidst huge product base. Product bundling is an\nexciting development in the field of product recommendations. It aims at\ngenerating new bundles and recommending exciting and relevant bundles to their\nconsumers. Unlike traditional recommender systems that recommend single items\nto consumers, product bundling aims at targeting a bundle, or a set of items,\nto the consumers. While bundle recommendation has attracted significant\nresearch interest recently, extant literature on bundle generation is scarce.\nMoreover, metrics to identify if a bundle is popular or not is not well\nstudied. In this work, we aim to fulfill this gap by introducing new bundle\npopularity metrics based on sales, consumer experience and item diversity in a\nbundle. We use these metrics in the methodology proposed in this paper to\ngenerate new bundles for mobile games using content aware and context aware\nembeddings. We use opensource Steam Games dataset for our analysis. Our\nexperiments indicate that we can generate new bundles that can outperform the\nexisting bundles on the popularity metrics by 32% - 44%. Our experiments are\ncomputationally efficient and the proposed methodology is generic that can be\nextended to other bundling problems e.g. product bundling, music bundling.\n","authors":["Ashutosh Nayak","Prajwal NJ","Sameeksha Keshav","Kavitha S. N.","Roja Reddy","Rajasekhara Reddy Duvvuru Muni"],"pdf_url":"https://arxiv.org/pdf/2412.17310v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17259v1","updated":"2024-12-23T04:02:46Z","published":"2024-12-23T04:02:46Z","title":"LegalAgentBench: Evaluating LLM Agents in Legal Domain","summary":" With the increasing intelligence and autonomy of LLM agents, their potential\napplications in the legal domain are becoming increasingly apparent. However,\nexisting general-domain benchmarks cannot fully capture the complexity and\nsubtle nuances of real-world judicial cognition and decision-making. Therefore,\nwe propose LegalAgentBench, a comprehensive benchmark specifically designed to\nevaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17\ncorpora from real-world legal scenarios and provides 37 tools for interacting\nwith external knowledge. We designed a scalable task construction framework and\ncarefully annotated 300 tasks. These tasks span various types, including\nmulti-hop reasoning and writing, and range across different difficulty levels,\neffectively reflecting the complexity of real-world legal scenarios. Moreover,\nbeyond evaluating final success, LegalAgentBench incorporates keyword analysis\nduring intermediate processes to calculate progress rates, enabling more\nfine-grained evaluation. We evaluated eight popular LLMs, highlighting the\nstrengths, limitations, and potential areas for improvement of existing models\nand methods. LegalAgentBench sets a new benchmark for the practical application\nof LLMs in the legal domain, with its code and data available at\n\\url{https://github.com/CSHaitao/LegalAgentBench}.\n","authors":["Haitao Li","Junjie Chen","Jingli Yang","Qingyao Ai","Wei Jia","Youfeng Liu","Kai Lin","Yueyue Wu","Guozhi Yuan","Yiran Hu","Wuyue Wang","Yiqun Liu","Minlie Huang"],"pdf_url":"https://arxiv.org/pdf/2412.17259v1.pdf","comment":"23 pages"},{"id":"http://arxiv.org/abs/2412.15005v2","updated":"2024-12-23T03:49:58Z","published":"2024-12-19T16:20:42Z","title":"DisCo: Graph-Based Disentangled Contrastive Learning for Cold-Start\n Cross-Domain Recommendation","summary":" Recommender systems are widely used in various real-world applications, but\nthey often encounter the persistent challenge of the user cold-start problem.\nCross-domain recommendation (CDR), which leverages user interactions from one\ndomain to improve prediction performance in another, has emerged as a promising\nsolution. However, users with similar preferences in the source domain may\nexhibit different interests in the target domain. Therefore, directly\ntransferring embeddings may introduce irrelevant source-domain collaborative\ninformation. In this paper, we propose a novel graph-based disentangled\ncontrastive learning framework to capture fine-grained user intent and filter\nout irrelevant collaborative information, thereby avoiding negative transfer.\nSpecifically, for each domain, we use a multi-channel graph encoder to capture\ndiverse user intents. We then construct the affinity graph in the embedding\nspace and perform multi-step random walks to capture high-order user similarity\nrelationships. Treating one domain as the target, we propose a disentangled\nintent-wise contrastive learning approach, guided by user similarity, to refine\nthe bridging of user intents across domains. Extensive experiments on four\nbenchmark CDR datasets demonstrate that DisCo consistently outperforms existing\nstate-of-the-art baselines, thereby validating the effectiveness of both DisCo\nand its components.\n","authors":["Hourun Li","Yifan Wang","Zhiping Xiao","Jia Yang","Changling Zhou","Ming Zhang","Wei Ju"],"pdf_url":"https://arxiv.org/pdf/2412.15005v2.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2412.17250v1","updated":"2024-12-23T03:49:00Z","published":"2024-12-23T03:49:00Z","title":"SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval","summary":" The performance of Dense retrieval (DR) is significantly influenced by the\nquality of negative sampling. Traditional DR methods primarily depend on naive\nnegative sampling techniques or on mining hard negatives through external\nretriever and meticulously crafted strategies. However, naive negative sampling\noften fails to adequately capture the accurate boundaries between positive and\nnegative samples, whereas existing hard negative sampling methods are prone to\nfalse negatives, resulting in performance degradation and training instability.\nRecent advancements in large language models (LLMs) offer an innovative\nsolution to these challenges by generating contextually rich and diverse\nnegative samples. In this work, we present a framework that harnesses LLMs to\nsynthesize high-quality hard negative samples. We first devise a\n\\textit{multi-attribute self-reflection prompting strategy} to direct LLMs in\nhard negative sample generation. Then, we implement a \\textit{hybrid sampling\nstrategy} that integrates these synthetic negatives with traditionally\nretrieved negatives, thereby stabilizing the training process and improving\nretrieval performance. Extensive experiments on five benchmark datasets\ndemonstrate the efficacy of our approach, and code is also publicly available.\n","authors":["Xiaopeng Li","Xiangyang Li","Hao Zhang","Zhaocheng Du","Pengyue Jia","Yichao Wang","Xiangyu Zhao","Huifeng Guo","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2412.17250v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17245v1","updated":"2024-12-23T03:37:58Z","published":"2024-12-23T03:37:58Z","title":"GraphHash: Graph Clustering Enables Parameter Efficiency in Recommender\n Systems","summary":" Deep recommender systems rely heavily on large embedding tables to handle\nhigh-cardinality categorical features such as user/item identifiers, and face\nsignificant memory constraints at scale. To tackle this challenge, hashing\ntechniques are often employed to map multiple entities to the same embedding\nand thus reduce the size of the embedding tables. Concurrently, graph-based\ncollaborative signals have emerged as powerful tools in recommender systems,\nyet their potential for optimizing embedding table reduction remains\nunexplored. This paper introduces GraphHash, the first graph-based approach\nthat leverages modularity-based bipartite graph clustering on user-item\ninteraction graphs to reduce embedding table sizes. We demonstrate that the\nmodularity objective has a theoretical connection to message-passing, which\nprovides a foundation for our method. By employing fast clustering algorithms,\nGraphHash serves as a computationally efficient proxy for message-passing\nduring preprocessing and a plug-and-play graph-based alternative to traditional\nID hashing. Extensive experiments show that GraphHash substantially outperforms\ndiverse hashing baselines on both retrieval and click-through-rate prediction\ntasks. In particular, GraphHash achieves on average a 101.52% improvement in\nrecall when reducing the embedding table size by more than 75%, highlighting\nthe value of graph-based collaborative information for model reduction.\n","authors":["Xinyi Wu","Donald Loveland","Runjin Chen","Yozen Liu","Xin Chen","Leonardo Neves","Ali Jadbabaie","Clark Mingxuan Ju","Neil Shah","Tong Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.17245v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17239v1","updated":"2024-12-23T03:19:19Z","published":"2024-12-23T03:19:19Z","title":"Unity is Strength: Unifying Convolutional and Transformeral Features for\n Better Person Re-Identification","summary":" Person Re-identification (ReID) aims to retrieve the specific person across\nnon-overlapping cameras, which greatly helps intelligent transportation\nsystems. As we all know, Convolutional Neural Networks (CNNs) and Transformers\nhave the unique strengths to extract local and global features, respectively.\nConsidering this fact, we focus on the mutual fusion between them to learn more\ncomprehensive representations for persons. In particular, we utilize the\ncomplementary integration of deep features from different model structures. We\npropose a novel fusion framework called FusionReID to unify the strengths of\nCNNs and Transformers for image-based person ReID. More specifically, we first\ndeploy a Dual-branch Feature Extraction (DFE) to extract features through CNNs\nand Transformers from a single image. Moreover, we design a novel\nDual-attention Mutual Fusion (DMF) to achieve sufficient feature fusions. The\nDMF comprises Local Refinement Units (LRU) and Heterogenous Transmission\nModules (HTM). LRU utilizes depth-separable convolutions to align deep features\nin channel dimensions and spatial sizes. HTM consists of a Shared Encoding Unit\n(SEU) and two Mutual Fusion Units (MFU). Through the continuous stacking of\nHTM, deep features after LRU are repeatedly utilized to generate more\ndiscriminative features. Extensive experiments on three public ReID benchmarks\ndemonstrate that our method can attain superior performances than most\nstate-of-the-arts. The source code is available at\nhttps://github.com/924973292/FusionReID.\n","authors":["Yuhao Wang","Pingping Zhang","Xuehu Liu","Zhengzheng Tu","Huchuan Lu"],"pdf_url":"https://arxiv.org/pdf/2412.17239v1.pdf","comment":"Accepted by Trans. on ITS"}],"Multimedia":[{"id":"http://arxiv.org/abs/2409.07759v2","updated":"2024-12-23T20:03:22Z","published":"2024-09-12T05:33:15Z","title":"SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming\n with Arbitrary Length","summary":" Recent advances in 3D Gaussian Splatting (3DGS) have garnered significant\nattention in computer vision and computer graphics due to its high rendering\nspeed and remarkable quality. While extant research has endeavored to extend\nthe application of 3DGS from static to dynamic scenes, such efforts have been\nconsistently impeded by excessive model sizes, constraints on video duration,\nand content deviation. These limitations significantly compromise the\nstreamability of dynamic 3D Gaussian models, thereby restricting their utility\nin downstream applications, including volumetric video, autonomous vehicle, and\nimmersive technologies such as virtual, augmented, and mixed reality.\n This paper introduces SwinGS, a novel framework for training, delivering, and\nrendering volumetric video in a real-time streaming fashion. To address the\naforementioned challenges and enhance streamability, SwinGS integrates\nspacetime Gaussian with Markov Chain Monte Carlo (MCMC) to adapt the model to\nfit various 3D scenes across frames, in the meantime employing a sliding window\ncaptures Gaussian snapshots for each frame in an accumulative way. We implement\na prototype of SwinGS and demonstrate its streamability across various datasets\nand scenes. Additionally, we develop an interactive WebGL viewer enabling\nreal-time volumetric video playback on most devices with modern browsers,\nincluding smartphones and tablets. Experimental results show that SwinGS\nreduces transmission costs by 83.6% compared to previous work with ignorable\ncompromise in PSNR. Moreover, SwinGS easily scales to long video sequences\nwithout compromising quality.\n","authors":["Bangya Liu","Suman Banerjee"],"pdf_url":"https://arxiv.org/pdf/2409.07759v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17907v1","updated":"2024-12-23T19:00:34Z","published":"2024-12-23T19:00:34Z","title":"A Multimodal Emotion Recognition System: Integrating Facial Expressions,\n Body Movement, Speech, and Spoken Language","summary":" Traditional psychological evaluations rely heavily on human observation and\ninterpretation, which are prone to subjectivity, bias, fatigue, and\ninconsistency. To address these limitations, this work presents a multimodal\nemotion recognition system that provides a standardised, objective, and\ndata-driven tool to support evaluators, such as psychologists, psychiatrists,\nand clinicians. The system integrates recognition of facial expressions,\nspeech, spoken language, and body movement analysis to capture subtle emotional\ncues that are often overlooked in human evaluations. By combining these\nmodalities, the system provides more robust and comprehensive emotional state\nassessment, reducing the risk of mis- and overdiagnosis. Preliminary testing in\na simulated real-world condition demonstrates the system's potential to provide\nreliable emotional insights to improve the diagnostic accuracy. This work\nhighlights the promise of automated multimodal analysis as a valuable\ncomplement to traditional psychological evaluation practices, with applications\nin clinical and therapeutic settings.\n","authors":["Kris Kraack"],"pdf_url":"https://arxiv.org/pdf/2412.17907v1.pdf","comment":"10 pages, 6 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.17667v1","updated":"2024-12-23T15:53:21Z","published":"2024-12-23T15:53:21Z","title":"VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music","summary":" In this work, we introduce VERSA, a unified and standardized evaluation\ntoolkit designed for various speech, audio, and music signals. The toolkit\nfeatures a Pythonic interface with flexible configuration and dependency\ncontrol, making it user-friendly and efficient. With full installation, VERSA\noffers 63 metrics with 711 metric variations based on different configurations.\nThese metrics encompass evaluations utilizing diverse external resources,\nincluding matching and non-matching reference audio, text transcriptions, and\ntext captions. As a lightweight yet comprehensive toolkit, VERSA is versatile\nto support the evaluation of a wide range of downstream scenarios. To\ndemonstrate its capabilities, this work highlights example use cases for VERSA,\nincluding audio coding, speech synthesis, speech enhancement, singing\nsynthesis, and music generation. The toolkit is available at\nhttps://github.com/shinjiwlab/versa.\n","authors":["Jiatong Shi","Hye-jin Shim","Jinchuan Tian","Siddhant Arora","Haibin Wu","Darius Petermann","Jia Qi Yip","You Zhang","Yuxun Tang","Wangyou Zhang","Dareen Safar Alharthi","Yichen Huang","Koichi Saito","Jionghao Han","Yiwen Zhao","Chris Donahue","Shinji Watanabe"],"pdf_url":"https://arxiv.org/pdf/2412.17667v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17632v1","updated":"2024-12-23T15:08:08Z","published":"2024-12-23T15:08:08Z","title":"ANID: How Far Are We? Evaluating the Discrepancies Between\n AI-synthesized Images and Natural Images through Multimodal Guidance","summary":" In the rapidly evolving field of Artificial Intelligence Generated Content\n(AIGC), one of the key challenges is distinguishing AI-synthesized images from\nnatural images. Despite the remarkable capabilities of advanced AI generative\nmodels in producing visually compelling images, significant discrepancies\nremain when these images are compared to natural ones. To systematically\ninvestigate and quantify these discrepancies, we introduce an AI-Natural Image\nDiscrepancy Evaluation benchmark aimed at addressing the critical question:\n\\textit{how far are AI-generated images (AIGIs) from truly realistic images?}\nWe have constructed a large-scale multimodal dataset, the Distinguishing\nNatural and AI-generated Images (DNAI) dataset, which includes over 440,000\nAIGI samples generated by 8 representative models using both unimodal and\nmultimodal prompts, such as Text-to-Image (T2I), Image-to-Image (I2I), and Text\n\\textit{vs.} Image-to-Image (TI2I). Our fine-grained assessment framework\nprovides a comprehensive evaluation of the DNAI dataset across five key\ndimensions: naive visual feature quality, semantic alignment in multimodal\ngeneration, aesthetic appeal, downstream task applicability, and coordinated\nhuman validation. Extensive evaluation results highlight significant\ndiscrepancies across these dimensions, underscoring the necessity of aligning\nquantitative metrics with human judgment to achieve a holistic understanding of\nAI-generated image quality. Code is available at\n\\href{https://github.com/ryliu68/ANID}{https://github.com/ryliu68/ANID}.\n","authors":["Renyang Liu","Ziyu Lyu","Wei Zhou","See-Kiong Ng"],"pdf_url":"https://arxiv.org/pdf/2412.17632v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.05039v2","updated":"2024-12-23T11:38:01Z","published":"2024-05-08T13:13:02Z","title":"Reviewing Intelligent Cinematography: AI research for camera-based video\n production","summary":" This paper offers the first comprehensive review of artificial intelligence\n(AI) research in the context of real camera content acquisition for\nentertainment purposes and is aimed at both researchers and cinematographers.\nAddressing the lack of review papers in the field of intelligent\ncinematography} (IC) and the breadth of related computer vision research, we\npresent a holistic view of the IC landscape while providing technical insight,\nimportant for experts across disciplines. We provide technical background on\ngenerative AI, object detection, automated camera calibration and 3-D content\nacquisition, with references to assist non-technical readers. The application\nsections categorize work in terms of four production types: General Production,\nVirtual Production, Live Production and Aerial Production. Within each\napplication section, we (1) sub-classify work according to research topic and\n(2) describe the trends and challenges relevant to each type of production. In\nthe final chapter, we address the greater scope of IC research and summarize\nthe significant potential of this area to influence the creative industries\nsector. We suggest that work relating to virtual production has the greatest\npotential to impact other mediums of production, driven by the growing interest\nin LED volumes/stages for in-camera virtual effects (ICVFX) and automated 3-D\ncapture for virtual modeling of real world scenes and actors. We also address\nethical and legal concerns regarding the use of creative AI that impact on\nartists, actors, technologists and the general public.\n","authors":["Adrian Azzarelli","Nantheera Anantrasirichai","David R Bull"],"pdf_url":"https://arxiv.org/pdf/2405.05039v2.pdf","comment":"For researchers and cinematographers. 43 pages including Table of\n Contents, List of Figures and Tables. We obtained permission to use Figures 5\n and 11. All other Figures have been drawn by us"},{"id":"http://arxiv.org/abs/2412.17477v1","updated":"2024-12-23T11:09:30Z","published":"2024-12-23T11:09:30Z","title":"Predicting Satisfied User and Machine Ratio for Compressed Images: A\n Unified Approach","summary":" Nowadays, high-quality images are pursued by both humans for better viewing\nexperience and by machines for more accurate visual analysis. However, images\nare usually compressed before being consumed, decreasing their quality. It is\nmeaningful to predict the perceptual quality of compressed images for both\nhumans and machines, which guides the optimization for compression. In this\npaper, we propose a unified approach to address this. Specifically, we create a\ndeep learning-based model to predict Satisfied User Ratio (SUR) and Satisfied\nMachine Ratio (SMR) of compressed images simultaneously. We first pre-train a\nfeature extractor network on a large-scale SMR-annotated dataset with human\nperception-related quality labels generated by diverse image quality models,\nwhich simulates the acquisition of SUR labels. Then, we propose an\nMLP-Mixer-based network to predict SUR and SMR by leveraging and fusing the\nextracted multi-layer features. We introduce a Difference Feature Residual\nLearning (DFRL) module to learn more discriminative difference features. We\nfurther use a Multi-Head Attention Aggregation and Pooling (MHAAP) layer to\naggregate difference features and reduce their redundancy. Experimental results\nindicate that the proposed model significantly outperforms state-of-the-art SUR\nand SMR prediction methods. Moreover, our joint learning scheme of human and\nmachine perceptual quality prediction tasks is effective at improving the\nperformance of both.\n","authors":["Qi Zhang","Shanshe Wang","Xinfeng Zhang","Siwei Ma","Jingshan Pan","Wen Gao"],"pdf_url":"https://arxiv.org/pdf/2412.17477v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17415v1","updated":"2024-12-23T09:26:38Z","published":"2024-12-23T09:26:38Z","title":"VidCtx: Context-aware Video Question Answering with Image Models","summary":" To address computational and memory limitations of Large Multimodal Models in\nthe Video Question-Answering task, several recent methods extract textual\nrepresentations per frame (e.g., by captioning) and feed them to a Large\nLanguage Model (LLM) that processes them to produce the final response.\nHowever, in this way, the LLM does not have access to visual information and\noften has to process repetitive textual descriptions of nearby frames. To\naddress those shortcomings, in this paper, we introduce VidCtx, a novel\ntraining-free VideoQA framework which integrates both modalities, i.e. both\nvisual information from input frames and textual descriptions of others frames\nthat give the appropriate context. More specifically, in the proposed framework\na pre-trained Large Multimodal Model (LMM) is prompted to extract at regular\nintervals, question-aware textual descriptions (captions) of video frames.\nThose will be used as context when the same LMM will be prompted to answer the\nquestion at hand given as input a) a certain frame, b) the question and c) the\ncontext/caption of an appropriate frame. To avoid redundant information, we\nchose as context the descriptions of distant frames. Finally, a simple yet\neffective max pooling mechanism is used to aggregate the frame-level decisions.\nThis methodology enables the model to focus on the relevant segments of the\nvideo and scale to a high number of frames. Experiments show that VidCtx\nachieves competitive performance among approaches that rely on open models on\nthree public Video QA benchmarks, NExT-QA, IntentQA and STAR.\n","authors":["Andreas Goulas","Vasileios Mezaris","Ioannis Patras"],"pdf_url":"https://arxiv.org/pdf/2412.17415v1.pdf","comment":"Submitted for publication"},{"id":"http://arxiv.org/abs/2408.03001v2","updated":"2024-12-23T09:03:02Z","published":"2024-08-06T07:19:51Z","title":"One Framework to Rule Them All: Unifying Multimodal Tasks with LLM\n Neural-Tuning","summary":" Large-scale models have exhibited remarkable capabilities across diverse\ndomains, including automated medical services and intelligent customer support.\nHowever, as most large models are trained on single-modality corpora, enabling\nthem to effectively process and understand multimodal signals remains a\nsignificant challenge. Current research often focuses on designing\ntask-specific or scenario-specific tuning strategies, which limits the\nscalability and versatility. To address this limitation, we propose a unified\nframework that concurrently handles multiple tasks and modalities. In this\nframework, all modalities and tasks are represented as unified tokens and\ntrained using a single, consistent approach. To enable efficient multitask\nprocessing, we introduce a novel tuning strategy termed neural tuning, inspired\nby the concept of sparse distributed representation in the human brain, where\nonly specific subsets of neurons are activated for each task. Furthermore, to\nadvance research in multimodal and multitask learning, we present a new\nbenchmark, MMUD, which includes samples annotated with multiple task labels\nspanning reasoning segmentation, referring segmentation, image captioning, and\ntext-to-image generation. By applying neural tuning to pretrained large models\non the MMUD benchmark, we demonstrate the ability to handle multiple tasks\nsimultaneously in a streamlined and efficient manner. All models, code, and\ndatasets will be released publicly upon publication, fostering further research\nand innovation in this field.\n","authors":["Hao Sun","Yu Song","Jiaqing Liu","Jihong Hu","Yen-Wei Chen","Lanfen Lin"],"pdf_url":"https://arxiv.org/pdf/2408.03001v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2303.10446v3","updated":"2024-12-23T06:55:56Z","published":"2023-03-18T16:09:10Z","title":"Content Adaptive Front End For Audio Classification","summary":" We propose a learnable content adaptive front end for audio signal\nprocessing. Before the modern advent of deep learning, we used fixed\nrepresentation non-learnable front-ends like spectrogram or mel-spectrogram\nwith/without neural architectures. With convolutional architectures supporting\nvarious applications such as ASR and acoustic scene understanding, a shift to a\nlearnable front ends occurred in which both the type of basis functions and the\nweight were learned from scratch and optimized for the particular task of\ninterest. With the shift to transformer-based architectures with no\nconvolutional blocks present, a linear layer projects small waveform patches\nonto a small latent dimension before feeding them to a transformer\narchitecture. In this work, we propose a way of computing a content-adaptive\nlearnable time-frequency representation. We pass each audio signal through a\nbank of convolutional filters, each giving a fixed-dimensional vector. It is\nakin to learning a bank of finite impulse-response filterbanks and passing the\ninput signal through the optimum filter bank depending on the content of the\ninput signal. A content-adaptive learnable time-frequency representation may be\nmore broadly applicable, beyond the experiments in this paper.\n","authors":["Prateek Verma","Chris Chafe"],"pdf_url":"https://arxiv.org/pdf/2303.10446v3.pdf","comment":"5 pages, 4 figures. 2023 IEEE International Conference on Acoustics,\n Speech, and Signal Processing, Rhodes, Greece; Minor Edits"},{"id":"http://arxiv.org/abs/2412.17238v1","updated":"2024-12-23T03:17:21Z","published":"2024-12-23T03:17:21Z","title":"Modality-Aware Shot Relating and Comparing for Video Scene Detection","summary":" Video scene detection involves assessing whether each shot and its\nsurroundings belong to the same scene. Achieving this requires meticulously\ncorrelating multi-modal cues, $\\it{e.g.}$ visual entity and place modalities,\namong shots and comparing semantic changes around each shot. However, most\nmethods treat multi-modal semantics equally and do not examine contextual\ndifferences between the two sides of a shot, leading to sub-optimal detection\nperformance. In this paper, we propose the $\\bf{M}$odality-$\\bf{A}$ware\n$\\bf{S}$hot $\\bf{R}$elating and $\\bf{C}$omparing approach (MASRC), which\nenables relating shots per their own characteristics of visual entity and place\nmodalities, as well as comparing multi-shots similarities to have scene changes\nexplicitly encoded. Specifically, to fully harness the potential of visual\nentity and place modalities in modeling shot relations, we mine long-term shot\ncorrelations from entity semantics while simultaneously revealing short-term\nshot correlations from place semantics. In this way, we can learn distinctive\nshot features that consolidate coherence within scenes and amplify\ndistinguishability across scenes. Once equipped with distinctive shot features,\nwe further encode the relations between preceding and succeeding shots of each\ntarget shot by similarity convolution, aiding in the identification of scene\nending shots. We validate the broad applicability of the proposed components in\nMASRC. Extensive experimental results on public benchmark datasets demonstrate\nthat the proposed MASRC significantly advances video scene detection.\n","authors":["Jiawei Tan","Hongxing Wang","Kang Dang","Jiaxin Li","Zhilong Ou"],"pdf_url":"https://arxiv.org/pdf/2412.17238v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15509v2","updated":"2024-12-23T03:04:33Z","published":"2024-12-20T02:45:37Z","title":"PolySmart @ TRECVid 2024 Video-To-Text","summary":" In this paper, we present our methods and results for the Video-To-Text (VTT)\ntask at TRECVid 2024, exploring the capabilities of Vision-Language Models\n(VLMs) like LLaVA and LLaVA-NeXT-Video in generating natural language\ndescriptions for video content. We investigate the impact of fine-tuning VLMs\non VTT datasets to enhance description accuracy, contextual relevance, and\nlinguistic consistency. Our analysis reveals that fine-tuning substantially\nimproves the model's ability to produce more detailed and domain-aligned text,\nbridging the gap between generic VLM tasks and the specialized needs of VTT.\nExperimental results demonstrate that our fine-tuned model outperforms baseline\nVLMs across various evaluation metrics, underscoring the importance of\ndomain-specific tuning for complex VTT tasks.\n","authors":["Jiaxin Wu","Wengyu Zhang","Xiao-Yong Wei","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2412.15509v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05831v2","updated":"2024-12-23T02:52:36Z","published":"2024-12-08T06:37:27Z","title":"Semi-Supervised Contrastive Learning for Controllable Video-to-Music\n Retrieval","summary":" Content creators often use music to enhance their videos, from soundtracks in\nmovies to background music in video blogs and social media content. However,\nidentifying the best music for a video can be a difficult and time-consuming\ntask. To address this challenge, we propose a novel framework for automatically\nretrieving a matching music clip for a given video, and vice versa. Our\napproach leverages annotated music labels, as well as the inherent artistic\ncorrespondence between visual and music elements. Distinct from previous\ncross-modal music retrieval works, our method combines both self-supervised and\nsupervised training objectives. We use self-supervised and label-supervised\ncontrastive learning to train a joint embedding space between music and video.\nWe show the effectiveness of our approach by using music genre labels for the\nsupervised training component, and our framework can be generalized to other\nmusic annotations (e.g., emotion, instrument, etc.). Furthermore, our method\nenables fine-grained control over how much the retrieval process focuses on\nself-supervised vs. label information at inference time. We evaluate the\nlearned embeddings through a variety of video-to-music and music-to-video\nretrieval tasks. Our experiments show that the proposed approach successfully\ncombines self-supervised and supervised objectives and is effective for\ncontrollable music-video retrieval.\n","authors":["Shanti Stewart","Gouthaman KV","Lie Lu","Andrea Fanelli"],"pdf_url":"https://arxiv.org/pdf/2412.05831v2.pdf","comment":"Accepted at ICASSP 2025"}]},"2024-12-22T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.17171v1","updated":"2024-12-22T21:56:15Z","published":"2024-12-22T21:56:15Z","title":"Enhancing Item Tokenization for Generative Recommendation through\n Self-Improvement","summary":" Generative recommendation systems, driven by large language models (LLMs),\npresent an innovative approach to predicting user preferences by modeling items\nas token sequences and generating recommendations in a generative manner. A\ncritical challenge in this approach is the effective tokenization of items,\nensuring that they are represented in a form compatible with LLMs. Current item\ntokenization methods include using text descriptions, numerical strings, or\nsequences of discrete tokens. While text-based representations integrate\nseamlessly with LLM tokenization, they are often too lengthy, leading to\ninefficiencies and complicating accurate generation. Numerical strings, while\nconcise, lack semantic depth and fail to capture meaningful item relationships.\nTokenizing items as sequences of newly defined tokens has gained traction, but\nit often requires external models or algorithms for token assignment. These\nexternal processes may not align with the LLM's internal pretrained\ntokenization schema, leading to inconsistencies and reduced model performance.\nTo address these limitations, we propose a self-improving item tokenization\nmethod that allows the LLM to refine its own item tokenizations during training\nprocess. Our approach starts with item tokenizations generated by any external\nmodel and periodically adjusts these tokenizations based on the LLM's learned\npatterns. Such alignment process ensures consistency between the tokenization\nand the LLM's internal understanding of the items, leading to more accurate\nrecommendations. Furthermore, our method is simple to implement and can be\nintegrated as a plug-and-play enhancement into existing generative\nrecommendation systems. Experimental results on multiple datasets and using\nvarious initial tokenization strategies demonstrate the effectiveness of our\nmethod, with an average improvement of 8\\% in recommendation performance.\n","authors":["Runjin Chen","Mingxuan Ju","Ngoc Bui","Dimosthenis Antypas","Stanley Cai","Xiaopeng Wu","Leonardo Neves","Zhangyang Wang","Neil Shah","Tong Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.17171v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17156v1","updated":"2024-12-22T20:45:15Z","published":"2024-12-22T20:45:15Z","title":"LLM-based relevance assessment still can't replace human relevance\n assessment","summary":" The use of large language models (LLMs) for relevance assessment in\ninformation retrieval has gained significant attention, with recent studies\nsuggesting that LLM-based judgments provide comparable evaluations to human\njudgments. Notably, based on TREC 2024 data, Upadhyay et al. make a bold claim\nthat LLM-based relevance assessments, such as those generated by the UMBRELA\nsystem, can fully replace traditional human relevance assessments in TREC-style\nevaluations. This paper critically examines this claim, highlighting practical\nand theoretical limitations that undermine the validity of this conclusion.\nFirst, we question whether the evidence provided by Upadhyay et al. really\nsupports their claim, particularly if a test collection is used asa benchmark\nfor future improvements. Second, through a submission deliberately intended to\ndo so, we demonstrate the ease with which automatic evaluation metrics can be\nsubverted, showing that systems designed to exploit these evaluations can\nachieve artificially high scores. Theoretical challenges -- such as the\ninherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and\nthe potential degradation of future LLM performance -- must be addressed before\nLLM-based relevance assessments can be considered a viable replacement for\nhuman judgments.\n","authors":["Charles L. A. Clarke","Laura Dietz"],"pdf_url":"https://arxiv.org/pdf/2412.17156v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.18043v2","updated":"2024-12-22T20:31:15Z","published":"2024-04-28T01:38:38Z","title":"Utilizing Large Language Models for Information Extraction from Real\n Estate Transactions","summary":" Real estate sales contracts contain crucial information for property\ntransactions, but manual data extraction can be time-consuming and error-prone.\nThis paper explores the application of large language models, specifically\ntransformer-based architectures, for automated information extraction from real\nestate contracts. We discuss challenges, techniques, and future directions in\nleveraging these models to improve efficiency and accuracy in real estate\ncontract analysis. We generated synthetic contracts using the real-world\ntransaction dataset, thereby fine-tuning the large-language model and achieving\nsignificant metrics improvements and qualitative improvements in information\nretrieval and reasoning tasks.\n","authors":["Yu Zhao","Haoxiang Gao"],"pdf_url":"https://arxiv.org/pdf/2404.18043v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.17075v1","updated":"2024-12-22T15:57:35Z","published":"2024-12-22T15:57:35Z","title":"Iterative NLP Query Refinement for Enhancing Domain-Specific Information\n Retrieval: A Case Study in Career Services","summary":" Retrieving semantically relevant documents in niche domains poses significant\nchallenges for traditional TF-IDF-based systems, often resulting in low\nsimilarity scores and suboptimal retrieval performance. This paper addresses\nthese challenges by introducing an iterative and semi-automated query\nrefinement methodology tailored to Humber College's career services webpages.\nInitially, generic queries related to interview preparation yield low\ntop-document similarities (approximately 0.2--0.3). To enhance retrieval\neffectiveness, we implement a two-fold approach: first, domain-aware query\nrefinement by incorporating specialized terms such as\nresources-online-learning, student-online-services, and career-advising;\nsecond, the integration of structured educational descriptors like \"online\nresume and interview improvement tools.\" Additionally, we automate the\nextraction of domain-specific keywords from top-ranked documents to suggest\nrelevant terms for query expansion. Through experiments conducted on five\nbaseline queries, our semi-automated iterative refinement process elevates the\naverage top similarity score from approximately 0.18 to 0.42, marking a\nsubstantial improvement in retrieval performance. The implementation details,\nincluding reproducible code and experimental setups, are made available in our\nGitHub repositories \\url{https://github.com/Elipei88/HumberChatbotBackend} and\n\\url{https://github.com/Nisarg851/HumberChatbot}. We also discuss the\nlimitations of our approach and propose future directions, including the\nintegration of advanced neural retrieval models.\n","authors":["Elham Peimani","Gurpreet Singh","Nisarg Mahyavanshi","Aman Arora","Awais Shaikh"],"pdf_url":"https://arxiv.org/pdf/2412.17075v1.pdf","comment":"To be submitted to CoLM 2025"},{"id":"http://arxiv.org/abs/2408.14393v2","updated":"2024-12-22T13:40:24Z","published":"2024-08-26T16:21:50Z","title":"CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper\n Influence","summary":" With increasing privacy concerns in artificial intelligence, regulations have\nmandated the right to be forgotten, granting individuals the right to withdraw\ntheir data from models. Machine unlearning has emerged as a potential solution\nto enable selective forgetting in models, particularly in recommender systems\nwhere historical data contains sensitive user information. Despite recent\nadvances in recommendation unlearning, evaluating unlearning methods\ncomprehensively remains challenging due to the absence of a unified evaluation\nframework and overlooked aspects of deeper influence, e.g., fairness. To\naddress these gaps, we propose CURE4Rec, the first comprehensive benchmark for\nrecommendation unlearning evaluation. CURE4Rec covers four aspects, i.e.,\nunlearning Completeness, recommendation Utility, unleaRning efficiency, and\nrecommendation fairnEss, under three data selection strategies, i.e., core\ndata, edge data, and random data. Specifically, we consider the deeper\ninfluence of unlearning on recommendation fairness and robustness towards data\nwith varying impact levels. We construct multiple datasets with CURE4Rec\nevaluation and conduct extensive experiments on existing recommendation\nunlearning methods. Our code is released at\nhttps://github.com/xiye7lai/CURE4Rec.\n","authors":["Chaochao Chen","Jiaming Zhang","Yizhao Zhang","Li Zhang","Lingjuan Lyu","Yuyuan Li","Biao Gong","Chenggang Yan"],"pdf_url":"https://arxiv.org/pdf/2408.14393v2.pdf","comment":"Accepted to NeurIPS 2024, Datasets and Benchmarks. Website:\n https://oktton.github.io"},{"id":"http://arxiv.org/abs/2412.16984v1","updated":"2024-12-22T12:00:04Z","published":"2024-12-22T12:00:04Z","title":"LLM-Powered User Simulator for Recommender System","summary":" User simulators can rapidly generate a large volume of timely user behavior\ndata, providing a testing platform for reinforcement learning-based recommender\nsystems, thus accelerating their iteration and optimization. However, prevalent\nuser simulators generally suffer from significant limitations, including the\nopacity of user preference modeling and the incapability of evaluating\nsimulation accuracy. In this paper, we introduce an LLM-powered user simulator\nto simulate user engagement with items in an explicit manner, thereby enhancing\nthe efficiency and effectiveness of reinforcement learning-based recommender\nsystems training. Specifically, we identify the explicit logic of user\npreferences, leverage LLMs to analyze item characteristics and distill user\nsentiments, and design a logical model to imitate real human engagement. By\nintegrating a statistical model, we further enhance the reliability of the\nsimulation, proposing an ensemble model that synergizes logical and statistical\ninsights for user interaction simulations. Capitalizing on the extensive\nknowledge and semantic generation capabilities of LLMs, our user simulator\nfaithfully emulates user behaviors and preferences, yielding high-fidelity\ntraining data that enrich the training of recommendation algorithms. We\nestablish quantifying and qualifying experiments on five datasets to validate\nthe simulator's effectiveness and stability across various recommendation\nscenarios.\n","authors":["Zijian Zhang","Shuchang Liu","Ziru Liu","Rui Zhong","Qingpeng Cai","Xiangyu Zhao","Chunxu Zhang","Qidong Liu","Peng Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.16984v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16969v1","updated":"2024-12-22T11:00:00Z","published":"2024-12-22T11:00:00Z","title":"Multifaceted User Modeling in Recommendation: A Federated Foundation\n Models Approach","summary":" Multifaceted user modeling aims to uncover fine-grained patterns and learn\nrepresentations from user data, revealing their diverse interests and\ncharacteristics, such as profile, preference, and personality. Recent studies\non foundation model-based recommendation have emphasized the Transformer\narchitecture's remarkable ability to capture complex, non-linear user-item\ninteraction relationships. This paper aims to advance foundation model-based\nrecommendersystems by introducing enhancements to multifaceted user modeling\ncapabilities. We propose a novel Transformer layer designed specifically for\nrecommendation, using the self-attention mechanism to capture sequential\nuser-item interaction patterns. Specifically, we design a group gating network\nto identify user groups, enabling hierarchical discovery across different\nlayers, thereby capturing the multifaceted nature of user interests through\nmultiple Transformer layers. Furthermore, to broaden the data scope and further\nenhance multifaceted user modeling, we extend the framework to a federated\nsetting, enabling the use of private datasets while ensuring privacy.\nExperimental validations on benchmark datasets demonstrate the superior\nperformance of our proposed method. Code is available.\n","authors":["Chunxu Zhang","Guodong Long","Hongkuan Guo","Zhaojie Liu","Guorui Zhou","Zijian Zhang","Yang Liu","Bo Yang"],"pdf_url":"https://arxiv.org/pdf/2412.16969v1.pdf","comment":"Accepted as a regular paper of AAAI25"},{"id":"http://arxiv.org/abs/2412.16933v1","updated":"2024-12-22T09:08:46Z","published":"2024-12-22T09:08:46Z","title":"Towards a Unified Paradigm: Integrating Recommendation Systems as a New\n Language in Large Models","summary":" This paper explores the use of Large Language Models (LLMs) for sequential\nrecommendation, which predicts users' future interactions based on their past\nbehavior. We introduce a new concept, \"Integrating Recommendation Systems as a\nNew Language in Large Models\" (RSLLM), which combines the strengths of\ntraditional recommenders and LLMs. RSLLM uses a unique prompting method that\ncombines ID-based item embeddings from conventional recommendation models with\ntextual item features. It treats users' sequential behaviors as a distinct\nlanguage and aligns the ID embeddings with the LLM's input space using a\nprojector. We also propose a two-stage LLM fine-tuning framework that refines a\npretrained LLM using a combination of two contrastive losses and a language\nmodeling loss. The LLM is first fine-tuned using text-only prompts, followed by\ntarget domain fine-tuning with unified prompts. This trains the model to\nincorporate behavioral knowledge from the traditional sequential recommender\ninto the LLM. Our empirical results validate the effectiveness of our proposed\nframework.\n","authors":["Kai Zheng","Qingfeng Sun","Can Xu","Peng Yu","Qingwei Guo"],"pdf_url":"https://arxiv.org/pdf/2412.16933v1.pdf","comment":"13 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.16922v1","updated":"2024-12-22T08:46:16Z","published":"2024-12-22T08:46:16Z","title":"Enhancing Supply Chain Transparency in Emerging Economies Using Online\n Contents and LLMs","summary":" In the current global economy, supply chain transparency plays a pivotal role\nin ensuring this security by enabling companies to monitor supplier performance\nand fostering accountability and responsibility. Despite the advancements in\nsupply chain relationship datasets like Bloomberg and FactSet, supply chain\ntransparency remains a significant challenge in emerging economies due to\nissues such as information asymmetry and institutional gaps in regulation. This\nstudy proposes a novel approach to enhance supply chain transparency in\nemerging economies by leveraging online content and large language models\n(LLMs). We develop a Supply Chain Knowledge Graph Mining System that integrates\nadvanced LLMs with web crawler technology to automatically collect and analyze\nsupply chain information. The system's effectiveness is validated through a\ncase study focusing on the semiconductor supply chain, a domain that has\nrecently gained significant attention due to supply chain risks. Our results\ndemonstrate that the proposed system provides greater applicability for\nemerging economies, such as mainland China, complementing the data gaps in\nexisting datasets. However, challenges including the accurate estimation of\nmonetary and material flows, the handling of time series data, synonyms\ndisambiguation, and mitigating biases from online contents still remains.\nFuture research should focus on addressing these issues to further enhance the\nsystem's capabilities and broaden its application to other emerging economies\nand industries.\n","authors":["Bohan Jin","Qianyou Sun","Lihua Chen"],"pdf_url":"https://arxiv.org/pdf/2412.16922v1.pdf","comment":"6 pages"},{"id":"http://arxiv.org/abs/2412.13825v2","updated":"2024-12-22T06:42:21Z","published":"2024-12-18T13:12:36Z","title":"MixRec: Heterogeneous Graph Collaborative Filtering","summary":" For modern recommender systems, the use of low-dimensional latent\nrepresentations to embed users and items based on their observed interactions\nhas become commonplace. However, many existing recommendation models are\nprimarily designed for coarse-grained and homogeneous interactions, which\nlimits their effectiveness in two critical dimensions. Firstly, these models\nfail to leverage the relational dependencies that exist across different types\nof user behaviors, such as page views, collects, comments, and purchases.\nSecondly, they struggle to capture the fine-grained latent factors that drive\nuser interaction patterns. To address these limitations, we present a\nheterogeneous graph collaborative filtering model MixRec that excels at\ndisentangling users' multi-behavior interaction patterns and uncovering the\nlatent intent factors behind each behavior. Our model achieves this by\nincorporating intent disentanglement and multi-behavior modeling, facilitated\nby a parameterized heterogeneous hypergraph architecture. Furthermore, we\nintroduce a novel contrastive learning paradigm that adaptively explores the\nadvantages of self-supervised data augmentation, thereby enhancing the model's\nresilience against data sparsity and expressiveness with relation\nheterogeneity. To validate the efficacy of MixRec, we conducted extensive\nexperiments on three public datasets. The results clearly demonstrate its\nsuperior performance, significantly outperforming various state-of-the-art\nbaselines. Our model is open-sourced and available at:\nhttps://github.com/HKUDS/MixRec.\n","authors":["Lianghao Xia","Meiyan Xie","Yong Xu","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2412.13825v2.pdf","comment":"This paper is accepted by WSDM'2025"},{"id":"http://arxiv.org/abs/2412.16855v1","updated":"2024-12-22T04:40:24Z","published":"2024-12-22T04:40:24Z","title":"GME: Improving Universal Multimodal Retrieval by Multimodal LLMs","summary":" Universal Multimodal Retrieval (UMR) aims to enable search across various\nmodalities using a unified model, where queries and candidates can consist of\npure text, images, or a combination of both. Previous work has attempted to\nadopt multimodal large language models (MLLMs) to realize UMR using only text\ndata. However, our preliminary experiments demonstrate that more diverse\nmultimodal training data can further unlock the potential of MLLMs. Despite its\neffectiveness, the existing multimodal training data is highly imbalanced in\nterms of modality, which motivates us to develop a training data synthesis\npipeline and construct a large-scale, high-quality fused-modal training\ndataset. Based on the synthetic training data, we develop the General\nMultimodal Embedder (GME), an MLLM-based dense retriever designed for UMR.\nFurthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the\neffectiveness of our approach. Experimental results show that our method\nachieves state-of-the-art performance among existing UMR methods. Last, we\nprovide in-depth analyses of model scaling, training strategies, and perform\nablation studies on both the model and synthetic data.\n","authors":["Xin Zhang","Yanzhao Zhang","Wen Xie","Mingxin Li","Ziqi Dai","Dingkun Long","Pengjun Xie","Meishan Zhang","Wenjie Li","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.16855v1.pdf","comment":"32 pages, models at\n https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct"},{"id":"http://arxiv.org/abs/2412.17872v1","updated":"2024-12-22T03:16:49Z","published":"2024-12-22T03:16:49Z","title":"Joint Knowledge Editing for Information Enrichment and Probability\n Promotion","summary":" Knowledge stored in large language models requires timely updates to reflect\nthe dynamic nature of real-world information. To update the knowledge, most\nknowledge editing methods focus on the low layers, since recent probes into the\nknowledge recall process reveal that the answer information is enriched in low\nlayers. However, these probes only and could only reveal critical recall stages\nfor the original answers, while the goal of editing is to rectify model's\nprediction for the target answers. This inconsistency indicates that both the\nprobe approaches and the associated editing methods are deficient. To mitigate\nthe inconsistency and identify critical editing regions, we propose a\ncontrast-based probe approach, and locate two crucial stages where the model\nbehavior diverges between the original and target answers: Information\nEnrichment in low layers and Probability Promotion in high layers. Building\nupon the insights, we develop the Joint knowledge Editing for information\nEnrichment and probability Promotion (JEEP) method, which jointly edits both\nthe low and high layers to modify the two critical recall stages. Considering\nthe mutual interference and growing forgetting due to dual modifications, JEEP\nis designed to ensure that updates to distinct regions share the same\nobjectives and are complementary. We rigorously evaluate JEEP by editing up to\nthousands of facts on various models, i.e., GPT-J (6B) and LLaMA (7B), and\naddressing diverse editing objectives, i.e., adding factual and counterfactual\nknowledge. In all tested scenarios, JEEP achieves best performances, validating\nthe effectiveness of the revealings of our probe approach and the designs of\nour editing method. Our code and data are available at\nhttps://github.com/Eric8932/JEEP.\n","authors":["Wenhang Shi","Yiren Chen","Shuqing Bian","Xinyi Zhang","Zhe Zhao","Pengfei Hu","Wei Lu","Xiaoyong Du"],"pdf_url":"https://arxiv.org/pdf/2412.17872v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2408.10397v2","updated":"2024-12-22T19:35:34Z","published":"2024-08-19T20:28:39Z","title":"Webcam-based Pupil Diameter Prediction Benefits from Upscaling","summary":" Capturing pupil diameter is essential for assessing psychological and\nphysiological states such as stress levels and cognitive load. However, the low\nresolution of images in eye datasets often hampers precise measurement. This\nstudy evaluates the impact of various upscaling methods, ranging from bicubic\ninterpolation to advanced super-resolution, on pupil diameter predictions. We\ncompare several pre-trained methods, including CodeFormer, GFPGAN, Real-ESRGAN,\nHAT, and SRResNet. Our findings suggest that pupil diameter prediction models\ntrained on upscaled datasets are highly sensitive to the selected upscaling\nmethod and scale. Our results demonstrate that upscaling methods consistently\nenhance the accuracy of pupil diameter prediction models, highlighting the\nimportance of upscaling in pupilometry. Overall, our work provides valuable\ninsights for selecting upscaling techniques, paving the way for more accurate\nassessments in psychological and physiological research.\n","authors":["Vijul Shah","Brian B. Moser","Ko Watanabe","Andreas Dengel"],"pdf_url":"https://arxiv.org/pdf/2408.10397v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00401v2","updated":"2024-12-22T17:32:54Z","published":"2023-12-01T07:50:53Z","title":"VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent\n Video Internet of Things","summary":" Video Internet of Things (VIoT) has shown full potential in collecting an\nunprecedented volume of video data. How to schedule the domain-specific\nperceiving models and analyze the collected videos uniformly, efficiently, and\nespecially intelligently to accomplish complicated tasks is challenging. To\naddress the challenge, we build VIoTGPT, the framework based on LLMs to\ncorrectly interact with humans, query knowledge videos, and invoke vision\nmodels to analyze multimedia data collaboratively. To support VIoTGPT and\nrelated future works, we meticulously crafted the VIoT-Tool dataset, including\nthe training dataset and the benchmark involving 11 representative vision\nmodels across three categories based on semi-automatic annotations. To guide\nLLM to act as the intelligent agent towards intelligent VIoT, we resort to the\nReAct instruction tuning method based on VIoT-Tool to learn the tool\ncapability. Quantitative and qualitative experiments and analyses demonstrate\nthe effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving\nhuman-centered experiences in VIoT applications. The project website is\nhttps://github.com/zhongyy/VIoTGPT.\n","authors":["Yaoyao Zhong","Mengshi Qi","Rui Wang","Yuhan Qiu","Yang Zhang","Huadong Ma"],"pdf_url":"https://arxiv.org/pdf/2312.00401v2.pdf","comment":"AAAI 2025, 12 pages"},{"id":"http://arxiv.org/abs/2412.17049v1","updated":"2024-12-22T15:00:16Z","published":"2024-12-22T15:00:16Z","title":"Modular Conversational Agents for Surveys and Interviews","summary":" Surveys and interviews (structured, semi-structured, or unstructured) are\nwidely used for collecting insights on emerging or hypothetical scenarios.\nTraditional human-led methods often face challenges related to cost,\nscalability, and consistency. Recently, various domains have begun to explore\nthe use of conversational agents (chatbots) powered by large language models\n(LLMs). However, as public investments and policies on infrastructure and\nservices often involve substantial public stakes and environmental risks, there\nis a need for a rigorous, transparent, privacy-preserving, and cost-efficient\ndevelopment framework tailored for such major decision-making processes. This\npaper addresses this gap by introducing a modular approach and its resultant\nparameterized process for designing conversational agents. We detail the system\narchitecture, integrating engineered prompts, specialized knowledge bases, and\ncustomizable, goal-oriented conversational logic in the proposed approach. We\ndemonstrate the adaptability, generalizability, and efficacy of our modular\napproach through three empirical studies: (1) travel preference surveys,\nhighlighting multimodal (voice, text, and image generation) capabilities; (2)\npublic opinion elicitation on a newly constructed, novel infrastructure\nproject, showcasing question customization and multilingual (English and\nFrench) capabilities; and (3) transportation expert consultation about future\ntransportation systems, highlighting real-time, clarification request\ncapabilities for open-ended questions, resilience in handling erratic inputs,\nand efficient transcript post-processing. The results show the effectiveness of\nthis modular approach and how it addresses key ethical, privacy, security, and\ntoken consumption concerns, setting the stage for the next-generation surveys\nand interviews.\n","authors":["Jiangbo Yu","Jinhua Zhao","Luis Miranda-Moreno","Matthew Korp"],"pdf_url":"https://arxiv.org/pdf/2412.17049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16982v1","updated":"2024-12-22T11:53:51Z","published":"2024-12-22T11:53:51Z","title":"InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions","summary":" Humans perform a variety of interactive motions, among which duet dance is\none of the most challenging interactions. However, in terms of human motion\ngenerative models, existing works are still unable to generate high-quality\ninteractive motions, especially in the field of duet dance. On the one hand, it\nis due to the lack of large-scale high-quality datasets. On the other hand, it\narises from the incomplete representation of interactive motion and the lack of\nfine-grained optimization of interactions. To address these challenges, we\npropose, InterDance, a large-scale duet dance dataset that significantly\nenhances motion quality, data scale, and the variety of dance genres. Built\nupon this dataset, we propose a new motion representation that can accurately\nand comprehensively describe interactive motion. We further introduce a\ndiffusion-based framework with an interaction refinement guidance strategy to\noptimize the realism of interactions progressively. Extensive experiments\ndemonstrate the effectiveness of our dataset and algorithm.\n","authors":["Ronghui Li","Youliang Zhang","Yachao Zhang","Yuxiang Zhang","Mingyang Su","Jie Guo","Ziwei Liu","Yebin Liu","Xiu Li"],"pdf_url":"https://arxiv.org/pdf/2412.16982v1.pdf","comment":"https://inter-dance.github.io/"},{"id":"http://arxiv.org/abs/2412.16944v1","updated":"2024-12-22T09:28:06Z","published":"2024-12-22T09:28:06Z","title":"Linguistics-Vision Monotonic Consistent Network for Sign Language\n Production","summary":" Sign Language Production (SLP) aims to generate sign videos corresponding to\nspoken language sentences, where the conversion of sign Glosses to Poses (G2P)\nis the key step. Due to the cross-modal semantic gap and the lack of\nword-action correspondence labels for strong supervision alignment, the SLP\nsuffers huge challenges in linguistics-vision consistency. In this work, we\npropose a Transformer-based Linguistics-Vision Monotonic Consistent Network\n(LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment\nand coarse-grained multimodal semantic consistency in language-visual cues\nthrough Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator\n(MSC). In the CSA, we constrain the implicit alignment between corresponding\ngloss and pose sequences by computing the cosine similarity association matrix\nbetween cross-modal feature sequences (i.e., the order consistency of\nfine-grained sign glosses and actions). As for MSC, we construct multimodal\ntriplets based on paired and unpaired samples in batch data. By pulling closer\nthe corresponding text-visual pairs and pushing apart the non-corresponding\ntext-visual pairs, we constrain the semantic co-occurrence degree between\ncorresponding gloss and pose sequences (i.e., the semantic consistency of\ncoarse-grained textual sentences and sign videos). Extensive experiments on the\npopular PHOENIX14T benchmark show that the LVMCN outperforms the\nstate-of-the-art.\n","authors":["Xu Wang","Shengeng Tang","Peipei Song","Shuo Wang","Dan Guo","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2412.16944v1.pdf","comment":"Accepted by ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.16928v1","updated":"2024-12-22T08:58:15Z","published":"2024-12-22T08:58:15Z","title":"AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory\n Estimation and Classification","summary":" The increasing use of compact UAVs has created significant threats to public\nsafety, while traditional drone detection systems are often bulky and costly.\nTo address these challenges, we propose AV-DTEC, a lightweight self-supervised\naudio-visual fusion-based anti-UAV system. AV-DTEC is trained using\nself-supervised learning with labels generated by LiDAR, and it simultaneously\nlearns audio and visual features through a parallel selective state-space\nmodel. With the learned features, a specially designed plug-and-play\nprimary-auxiliary feature enhancement module integrates visual features into\naudio features for better robustness in cross-lighting conditions. To reduce\nreliance on auxiliary features and align modalities, we propose a\nteacher-student model that adaptively adjusts the weighting of visual features.\nAV-DTEC demonstrates exceptional accuracy and effectiveness in real-world\nmulti-modality data. The code and trained models are publicly accessible on\nGitHub\n \\url{https://github.com/AmazingDay1/AV-DETC}.\n","authors":["Zhenyuan Xiao","Yizhuo Yang","Guili Xu","Xianglong Zeng","Shenghai Yuan"],"pdf_url":"https://arxiv.org/pdf/2412.16928v1.pdf","comment":"Submitted to ICRA 2025"},{"id":"http://arxiv.org/abs/2412.16861v1","updated":"2024-12-22T05:04:17Z","published":"2024-12-22T05:04:17Z","title":"SoundLoc3D: Invisible 3D Sound Source Localization and Classification\n Using a Multimodal RGB-D Acoustic Camera","summary":" Accurately localizing 3D sound sources and estimating their semantic labels\n-- where the sources may not be visible, but are assumed to lie on the physical\nsurface of objects in the scene -- have many real applications, including\ndetecting gas leak and machinery malfunction. The audio-visual weak-correlation\nin such setting poses new challenges in deriving innovative methods to answer\nif or how we can use cross-modal information to solve the task. Towards this\nend, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D\ncamera and a coplanar four-channel microphone array~(Mic-Array). By using this\nrig to record audio-visual signals from multiviews, we can use the cross-modal\ncues to estimate the sound sources 3D locations. Specifically, our framework\nSoundLoc3D treats the task as a set prediction problem, each element in the set\ncorresponds to a potential sound source. Given the audio-visual\nweak-correlation, the set representation is initially learned from a single\nview microphone array signal, and then refined by actively incorporating\nphysical surface cues revealed from multiview RGB-D images. We demonstrate the\nefficiency and superiority of SoundLoc3D on large-scale simulated dataset, and\nfurther show its robustness to RGB-D measurement inaccuracy and ambient noise\ninterference.\n","authors":["Yuhang He","Sangyun Shin","Anoop Cherian","Andrew Markham"],"pdf_url":"https://arxiv.org/pdf/2412.16861v1.pdf","comment":"Accepted by WACV2025"}]},"2024-12-21T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2411.14100v2","updated":"2024-12-21T19:15:27Z","published":"2024-11-21T13:05:18Z","title":"BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken\n Term Detection","summary":" Spoken term detection (STD) is often hindered by reliance on frame-level\nfeatures and the computationally intensive DTW-based template matching,\nlimiting its practicality. To address these challenges, we propose a novel\napproach that encodes speech into discrete, speaker-agnostic semantic tokens.\nThis facilitates fast retrieval using text-based search algorithms and\neffectively handles out-of-vocabulary terms. Our approach focuses on generating\nconsistent token sequences across varying utterances of the same term. We also\npropose a bidirectional state space modeling within the Mamba encoder, trained\nin a self-supervised learning framework, to learn contextual frame-level\nfeatures that are further encoded into discrete tokens. Our analysis shows that\nour speech tokens exhibit greater speaker invariance than those from existing\ntokenizers, making them more suitable for STD tasks. Empirical evaluation on\nLibriSpeech and TIMIT databases indicates that our method outperforms existing\nSTD baselines while being more efficient.\n","authors":["Anup Singh","Kris Demuynck","Vipul Arora"],"pdf_url":"https://arxiv.org/pdf/2411.14100v2.pdf","comment":"Accepted at ICASSP 2025"},{"id":"http://arxiv.org/abs/2412.16708v1","updated":"2024-12-21T17:31:52Z","published":"2024-12-21T17:31:52Z","title":"Towards More Robust Retrieval-Augmented Generation: Evaluating RAG Under\n Adversarial Poisoning Attacks","summary":" Retrieval-Augmented Generation (RAG) systems have emerged as a promising\nsolution to mitigate LLM hallucinations and enhance their performance in\nknowledge-intensive domains. However, these systems are vulnerable to\nadversarial poisoning attacks, where malicious passages injected into retrieval\ndatabases can mislead the model into generating factually incorrect outputs. In\nthis paper, we investigate both the retrieval and the generation components of\nRAG systems to understand how to enhance their robustness against such attacks.\nFrom the retrieval perspective, we analyze why and how the adversarial contexts\nare retrieved and assess how the quality of the retrieved passages impacts\ndownstream generation. From a generation perspective, we evaluate whether LLMs'\nadvanced critical thinking and internal knowledge capabilities can be leveraged\nto mitigate the impact of adversarial contexts, i.e., using skeptical prompting\nas a self-defense mechanism. Our experiments and findings provide actionable\ninsights into designing safer and more resilient retrieval-augmented\nframeworks, paving the way for their reliable deployment in real-world\napplications.\n","authors":["Jinyan Su","Jin Peng Zhou","Zhengxin Zhang","Preslav Nakov","Claire Cardie"],"pdf_url":"https://arxiv.org/pdf/2412.16708v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16701v1","updated":"2024-12-21T16:59:00Z","published":"2024-12-21T16:59:00Z","title":"AlzheimerRAG: Multimodal Retrieval Augmented Generation for PubMed\n articles","summary":" Recent advancements in generative AI have flourished the development of\nhighly adept Large Language Models (LLMs) that integrate diverse data types to\nempower decision-making. Among these, Multimodal Retrieval-Augmented Generation\n(RAG) applications are promising for their capability to combine the strengths\nof information retrieval and generative models, enhancing their utility across\nvarious domains, including biomedical research. This paper introduces\nAlzheimerRAG, a Multimodal RAG pipeline tool for biomedical research use cases,\nprimarily focusing on Alzheimer's disease from PubMed articles. Our pipeline\nincorporates multimodal fusion techniques to integrate textual and visual data\nprocessing by efficiently indexing and accessing vast amounts of biomedical\nliterature. Preliminary experimental results against benchmarks, such as BioASQ\nand PubMedQA, have returned improved results in information retrieval and\nsynthesis of domain-specific information. We also demonstrate a case study with\nour RAG pipeline across different Alzheimer's clinical scenarios. We infer that\nAlzheimerRAG can generate responses with accuracy non-inferior to humans and\nwith low rates of hallucination. Overall, a reduction in cognitive task load is\nobserved, which allows researchers to gain multimodal insights, improving\nunderstanding and treatment of Alzheimer's disease.\n","authors":["Aritra Kumar Lahiri","Qinmin Vivian Hu"],"pdf_url":"https://arxiv.org/pdf/2412.16701v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16694v1","updated":"2024-12-21T16:36:52Z","published":"2024-12-21T16:36:52Z","title":"DragonVerseQA: Open-Domain Long-Form Context-Aware Question-Answering","summary":" This paper proposes a novel approach to develop an open-domain and long-form\nOver-The-Top (OTT) Question-Answering (QA) dataset, DragonVerseQA, specifically\noriented to the fantasy universe of \"House of the Dragon\" and \"Game Of Thrones\"\nTV series. Most existing QA datasets focus on short, fact-based answers sourced\nalmost solely from Wikipedia articles, devoid of depth and contextual richness\nfor sophisticated narrative understanding. We curate a dataset that combines\nfull episode summaries sourced from HBO and fandom wiki websites, user reviews\nfrom sources like IMDb and Rotten Tomatoes, and high-quality, open-domain,\nlegally admissible sources, and structured data from repositories like WikiData\ninto one dataset. The dataset provides a multi-dimensional context, reflecting\ncomplex character dynamics and plot developments from these varied sources.\nThat means, on equal footing, only after heavy data preprocessing and filtering\nmethods will meaningful, non-spam unbiased reviews be available in this\nenriched dataset. The comprehensive insights are given through the long-form\nanswers generated from this enriched context. This is what makes this valuable\ndataset for improving conversational AI, narrative analysis, sentiment\nanalysis, summarization techniques, and relation extraction.\n A comparative analysis with state-of-the-art QA datasets such as SQuAD 2.0,\nTriviaQA, and Natural Questions brings to light the unique advantages of our\ndataset in terms of contextual complexity and answer length. Detailed reviews\nadd layers to audience sentiment and narrative interpretation, raising the bar\nfor domain-specific QA with a new quality benchmark. Our work also allows a\ndeeper understanding of entertainment-industry content and opens the door to\nmore knowledgeable and creative AI-driven interactions within digital media\nenvironments.\n","authors":["Aritra Kumar Lahiri","Qinmin Vivian Hu"],"pdf_url":"https://arxiv.org/pdf/2412.16694v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08231v2","updated":"2024-12-21T15:59:05Z","published":"2024-08-15T15:56:23Z","title":"DaRec: A Disentangled Alignment Framework for Large Language Model and\n Recommender System","summary":" Benefiting from the strong reasoning capabilities, Large language models\n(LLMs) have demonstrated remarkable performance in recommender systems. Various\nefforts have been made to distill knowledge from LLMs to enhance collaborative\nmodels, employing techniques like contrastive learning for representation\nalignment. In this work, we prove that directly aligning the representations of\nLLMs and collaborative models is sub-optimal for enhancing downstream\nrecommendation tasks performance, based on the information theorem.\nConsequently, the challenge of effectively aligning semantic representations\nbetween collaborative models and LLMs remains unresolved. Inspired by this\nviewpoint, we propose a novel plug-and-play alignment framework for LLMs and\ncollaborative models. Specifically, we first disentangle the latent\nrepresentations of both LLMs and collaborative models into specific and shared\ncomponents via projection layers and representation regularization.\nSubsequently, we perform both global and local structure alignment on the\nshared representations to facilitate knowledge transfer. Additionally, we\ntheoretically prove that the specific and shared representations contain more\npertinent and less irrelevant information, which can enhance the effectiveness\nof downstream recommendation tasks. Extensive experimental results on benchmark\ndatasets demonstrate that our method is superior to existing state-of-the-art\nalgorithms.\n","authors":["Xihong Yang","Heming Jing","Zixing Zhang","Jindong Wang","Huakang Niu","Shuaiqiang Wang","Yu Lu","Junfeng Wang","Dawei Yin","Xinwang Liu","En Zhu","Defu Lian","Erxue Min"],"pdf_url":"https://arxiv.org/pdf/2408.08231v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16615v1","updated":"2024-12-21T13:19:15Z","published":"2024-12-21T13:19:15Z","title":"Large Language Model Can Be a Foundation for Hidden Rationale-Based\n Retrieval","summary":" Despite the recent advancement in Retrieval-Augmented Generation (RAG)\nsystems, most retrieval methodologies are often developed for factual\nretrieval, which assumes query and positive documents are semantically similar.\nIn this paper, we instead propose and study a more challenging type of\nretrieval task, called hidden rationale retrieval, in which query and document\nare not similar but can be inferred by reasoning chains, logic relationships,\nor empirical experiences. To address such problems, an instruction-tuned Large\nlanguage model (LLM) with a cross-encoder architecture could be a reasonable\nchoice. To further strengthen pioneering LLM-based retrievers, we design a\nspecial instruction that transforms the retrieval task into a generative task\nby prompting LLM to answer a binary-choice question. The model can be\nfine-tuned with direct preference optimization (DPO). The framework is also\noptimized for computational efficiency with no performance degradation. We name\nthis retrieval framework by RaHoRe and verify its zero-shot and fine-tuned\nperformance superiority on Emotional Support Conversation (ESC), compared with\nprevious retrieval works. Our study suggests the potential to employ LLM as a\nfoundation for a wider scope of retrieval tasks. Our codes, models, and\ndatasets are available on https://github.com/flyfree5/LaHoRe.\n","authors":["Luo Ji","Feixiang Guo","Teng Chen","Qingqing Gu","Xiaoyu Wang","Ningyuan Xi","Yihong Wang","Peng Yu","Yue Zhao","Hongyang Lei","Zhonglin Jiang","Yong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.16615v1.pdf","comment":"11 pages, 3 figures, accepted by ECIR 2025"},{"id":"http://arxiv.org/abs/2412.16589v1","updated":"2024-12-21T11:30:54Z","published":"2024-12-21T11:30:54Z","title":"Improving FIM Code Completions via Context & Curriculum Based Learning","summary":" Fill-in-the-Middle (FIM) models play a vital role in code completion tasks,\nleveraging both prefix and suffix context to provide more accurate and\ncontextually relevant suggestions. This paper presents approaches to improve\nFIM code completion while addressing the challenge of maintaining low latency\nfor real-time coding assistance. We enhance FIM code completion by\nincorporating context and curriculum examples in the training process. We\nidentify patterns where completion suggestions fail more frequently, revealing\ncomplexities that smaller language models struggle with. To address these\nchallenges, we develop a curriculum dataset by extracting hard-to-complete\npatterns from code repositories and generate context examples using semantic\nand static analysis tools (e.g. TSC compiler). We fine-tune various sized\nmodels, including StarCoder and DeepSeek, on this enhanced dataset. Our\nevaluation encompasses three key dimensions: the Santa Coder FIM task, the\nAmazon CCEval benchmark, and a new Multi-Line Infilling evaluation benchmark\nderived from SWE-bench. Comprehensive ablation studies across multiple model\nsizes reveal that while all fine-tuned models show improvements, the\nperformance gains are more pronounced for smaller parameter models and\nincorporating difficult-to-complete examples, as part of curriculum learning,\nimproves the code completion performance. This finding is particularly\nsignificant given the latency constraints of code completion tasks. While\nlarger models like GPT and Claude perform well in multi-line completions but\nare prohibitively challenging to use given high latency, and our fine-tuned\nmodels achieve a balance between performance and latency. Finally, we validate\nour approach through online A/B testing, demonstrating tangible improvements in\nCompletion Acceptance Rate (CAR) and Completion Persistence Rate (CPR), with\nzero latency impact.\n","authors":["Hitesh Sagtani","Rishabh Mehrotra","Beyang Liu"],"pdf_url":"https://arxiv.org/pdf/2412.16589v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12730v2","updated":"2024-12-21T10:29:30Z","published":"2024-09-19T12:55:34Z","title":"When SparseMoE Meets Noisy Interactions: An Ensemble View on Denoising\n Recommendation","summary":" Learning user preferences from implicit feedback is one of the core\nchallenges in recommendation. The difficulty lies in the potential noise within\nimplicit feedback. Therefore, various denoising recommendation methods have\nbeen proposed recently. However, most of them overly rely on the hyperparameter\nconfigurations, inevitably leading to inadequacies in model adaptability and\ngeneralization performance. In this study, we propose a novel Adaptive Ensemble\nLearning (AEL) for denoising recommendation, which employs a sparse gating\nnetwork as a brain, selecting suitable experts to synthesize appropriate\ndenoising capacities for different data samples. To address the ensemble\nlearning shortcoming of model complexity and ensure sub-recommender diversity,\nwe also proposed a novel method that stacks components to create\nsub-recommenders instead of directly constructing them. Extensive experiments\nacross various datasets demonstrate that AEL outperforms others in kinds of\npopular metrics, even in the presence of substantial and dynamic noise. Our\ncode is available at https://github.com/cpu9xx/AEL.\n","authors":["Weipu Chen","Zhuangzhuang He","Fei Liu"],"pdf_url":"https://arxiv.org/pdf/2409.12730v2.pdf","comment":"Accepted at ICASSP 2025. 5pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.08300v3","updated":"2024-12-21T09:17:08Z","published":"2024-12-11T11:29:15Z","title":"Augmenting Sequential Recommendation with Balanced Relevance and\n Diversity","summary":" By generating new yet effective data, data augmentation has become a\npromising method to mitigate the data sparsity problem in sequential\nrecommendation. Existing works focus on augmenting the original data but rarely\nexplore the issue of imbalanced relevance and diversity for augmented data,\nleading to semantic drift problems or limited performance improvements. In this\npaper, we propose a novel Balanced data Augmentation Plugin for Sequential\nRecommendation (BASRec) to generate data that balance relevance and diversity.\nBASRec consists of two modules: Single-sequence Augmentation and Cross-sequence\nAugmentation. The former leverages the randomness of the heuristic operators to\ngenerate diverse sequences for a single user, after which the diverse and the\noriginal sequences are fused at the representation level to obtain relevance.\nFurther, we devise a reweighting strategy to enable the model to learn the\npreferences based on the two properties adaptively. The Cross-sequence\nAugmentation performs nonlinear mixing between different sequence\nrepresentations from two directions. It produces virtual sequence\nrepresentations that are diverse enough but retain the vital semantics of the\noriginal sequences. These two modules enhance the model to discover\nfine-grained preferences knowledge from single-user and cross-user\nperspectives. Extensive experiments verify the effectiveness of BASRec. The\naverage improvement is up to 72.0% on GRU4Rec, 33.8% on SASRec, and 68.5% on\nFMLP-Rec. We demonstrate that BASRec generates data with a better balance\nbetween relevance and diversity than existing methods. The source code is\navailable at https://github.com/KingGugu/BASRec.\n","authors":["Yizhou Dang","Jiahui Zhang","Yuting Liu","Enneng Yang","Yuliang Liang","Guibing Guo","Jianzhe Zhao","Xingwei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.08300v3.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.16502v1","updated":"2024-12-21T06:21:45Z","published":"2024-12-21T06:21:45Z","title":"STKDRec: Spatial-Temporal Knowledge Distillation for Takeaway\n Recommendation","summary":" The takeaway recommendation system is designed to recommend users' future\ntakeaway purchases based on their historical purchase behaviors, thereby\nimproving user satisfaction and increasing merchant sales. Existing methods\nfocus on incorporating auxiliary information or leveraging knowledge graphs to\nalleviate the sparsity issue of user purchase sequence data. However, two main\nchallenges limit the performance of these approaches: (1) how to capture\ndynamic user preferences on complex geospatial information and (2) how to\nefficiently integrate spatial-temporal knowledge from graphs and sequence data\nwith low calculation costs. In this paper, we propose a novel spatial-temporal\nknowledge distillation for takeaway recommendation model (STKDRec) based on the\ntwo-stage training process. Specifically, during the first pre-training stage,\na spatial-temporal knowledge graph (STKG) encoder is pre-trained to extract the\nhigh-order spatial-temporal and collaborative associations within the STKG.\nDuring the second STKD stage, a spatial-temporal Transformer is employed to\ncomprehensively model dynamic user preferences on various types of fine-grained\ngeospatial information from a sequence perspective. Furthermore, the STKD\nstrategy is introduced to adaptively fuse the rich spatial-temporal knowledge\nfrom the pre-trained STKG encoder and the spatial-temporal transformer while\nreducing the cost of model training. Extensive experiments on three real-world\ndatasets show that our STKDRec significantly outperforms the state-of-the-art\nbaselines. Our code is available at:https://github.com/Zhaoshuyuan0246/STKDRec.\n","authors":["Shuyuan Zhao","Wei Chen","Boyan Shi","Liyong Zhou","Shuohao Lin","Huaiyu Wan"],"pdf_url":"https://arxiv.org/pdf/2412.16502v1.pdf","comment":"AAAI2025"},{"id":"http://arxiv.org/abs/2409.19925v2","updated":"2024-12-21T06:09:16Z","published":"2024-09-30T03:59:06Z","title":"LLMEmb: Large Language Model Can Be a Good Embedding Generator for\n Sequential Recommendation","summary":" Sequential Recommender Systems (SRS), which model a user's interaction\nhistory to predict the next item of interest, are widely used in various\napplications. However, existing SRS often struggle with low-popularity items, a\nchallenge known as the long-tail problem. This issue leads to reduced\nserendipity for users and diminished profits for sellers, ultimately harming\nthe overall system. Large Language Model (LLM) has the ability to capture\nsemantic relationships between items, independent of their popularity, making\nit a promising solution to this problem. In this paper, we introduce LLMEmb, a\nnovel method leveraging LLM to generate item embeddings that enhance SRS\nperformance. To bridge the gap between general-purpose LLM and the\nrecommendation domain, we propose a Supervised Contrastive Fine-Tuning (SCFT)\napproach. This approach includes attribute-level data augmentation and a\ntailored contrastive loss to make LLM more recommendation-friendly.\nAdditionally, we emphasize the importance of integrating collaborative signals\ninto LLM-generated embeddings, for which we propose Recommendation Adaptation\nTraining (RAT). This further refines the embeddings for optimal use in SRS. The\nLLMEmb-derived embeddings can be seamlessly integrated with any SRS models,\nunderscoring the practical value. Comprehensive experiments conducted on three\nreal-world datasets demonstrate that LLMEmb significantly outperforms existing\nmethods across multiple SRS models. The code for our method is released online\nhttps://github.com/Applied-Machine-Learning-Lab/LLMEmb.\n","authors":["Qidong Liu","Xian Wu","Wanyu Wang","Yejing Wang","Yuanshao Zhu","Xiangyu Zhao","Feng Tian","Yefeng Zheng"],"pdf_url":"https://arxiv.org/pdf/2409.19925v2.pdf","comment":"accepted by AAAI'25"},{"id":"http://arxiv.org/abs/2410.01154v2","updated":"2024-12-21T02:58:42Z","published":"2024-10-02T01:12:54Z","title":"Unleashing the Power of Large Language Models in Zero-shot Relation\n Extraction via Self-Prompting","summary":" Recent research in zero-shot Relation Extraction (RE) has focused on using\nLarge Language Models (LLMs) due to their impressive zero-shot capabilities.\nHowever, current methods often perform suboptimally, mainly due to a lack of\ndetailed, context-specific prompts needed for understanding various sentences\nand relations. To address this, we introduce the Self-Prompting framework, a\nnovel method designed to fully harness the embedded RE knowledge within LLMs.\nSpecifically, our framework employs a three-stage diversity approach to prompt\nLLMs, generating multiple synthetic samples that encapsulate specific relations\nfrom scratch. These generated samples act as in-context learning samples,\noffering explicit and context-specific guidance to efficiently prompt LLMs for\nRE. Experimental evaluations on benchmark datasets show our approach\noutperforms existing LLM-based zero-shot RE methods. Additionally, our\nexperiments confirm the effectiveness of our generation pipeline in producing\nhigh-quality synthetic data that enhances performance.\n","authors":["Siyi Liu","Yang Li","Jiang Li","Shan Yang","Yunshi Lan"],"pdf_url":"https://arxiv.org/pdf/2410.01154v2.pdf","comment":"EMNLP 2024 Short"},{"id":"http://arxiv.org/abs/2412.08950v2","updated":"2024-12-21T02:52:45Z","published":"2024-12-12T05:28:34Z","title":"Predicting Quality of Video Gaming Experience Using Global-Scale\n Telemetry Data and Federated Learning","summary":" Frames Per Second (FPS) significantly affects the gaming experience.\nProviding players with accurate FPS estimates prior to purchase benefits both\nplayers and game developers. However, we have a limited understanding of how to\npredict a game's FPS performance on a specific device. In this paper, we first\nconduct a comprehensive analysis of a wide range of factors that may affect\ngame FPS on a global-scale dataset to identify the determinants of FPS. This\nincludes player-side and game-side characteristics, as well as country-level\nsocio-economic statistics. Furthermore, recognizing that accurate FPS\npredictions require extensive user data, which raises privacy concerns, we\npropose a federated learning-based model to ensure user privacy. Each player\nand game is assigned a unique learnable knowledge kernel that gradually\nextracts latent features for improved accuracy. We also introduce a novel\ntraining and prediction scheme that allows these kernels to be dynamically\nplug-and-play, effectively addressing cold start issues. To train this model\nwith minimal bias, we collected a large telemetry dataset from 224 countries\nand regions, 100,000 users, and 835 games. Our model achieved a mean\nWasserstein distance of 0.469 between predicted and ground truth FPS\ndistributions, outperforming all baseline methods.\n","authors":["Zhongyang Zhang","Jinhe Wen","Zixi Chen","Dara Arbab","Sruti Sahani","William Lewis","Kent Giard","Bijan Arbab","Haojian Jin","Tauhidur Rahman"],"pdf_url":"https://arxiv.org/pdf/2412.08950v2.pdf","comment":"22 pages, 11 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.16435v1","updated":"2024-12-21T01:52:03Z","published":"2024-12-21T01:52:03Z","title":"THeGCN: Temporal Heterophilic Graph Convolutional Network","summary":" Graph Neural Networks (GNNs) have exhibited remarkable efficacy in diverse\ngraph learning tasks, particularly on static homophilic graphs. Recent\nattention has pivoted towards more intricate structures, encompassing (1)\nstatic heterophilic graphs encountering the edge heterophily issue in the\nspatial domain and (2) event-based continuous graphs in the temporal domain.\nState-of-the-art (SOTA) has been concurrently addressing these two lines of\nwork but tends to overlook the presence of heterophily in the temporal domain,\nconstituting the temporal heterophily issue. Furthermore, we highlight that the\nedge heterophily issue and the temporal heterophily issue often co-exist in\nevent-based continuous graphs, giving rise to the temporal edge heterophily\nchallenge. To tackle this challenge, this paper first introduces the temporal\nedge heterophily measurement. Subsequently, we propose the Temporal\nHeterophilic Graph Convolutional Network (THeGCN), an innovative model that\nincorporates the low/high-pass graph signal filtering technique to accurately\ncapture both edge (spatial) heterophily and temporal heterophily. Specifically,\nthe THeGCN model consists of two key components: a sampler and an aggregator.\nThe sampler selects events relevant to a node at a given moment. Then, the\naggregator executes message-passing, encoding temporal information, node\nattributes, and edge attributes into node embeddings. Extensive experiments\nconducted on 5 real-world datasets validate the efficacy of THeGCN.\n","authors":["Yuchen Yan","Yuzhong Chen","Huiyuan Chen","Xiaoting Li","Zhe Xu","Zhichen Zeng","Zhining Liu","Hanghang Tong"],"pdf_url":"https://arxiv.org/pdf/2412.16435v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.16530v1","updated":"2024-12-21T08:15:52Z","published":"2024-12-21T08:15:52Z","title":"Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech\n Translation","summary":" Audio-Visual Speech-to-Speech Translation typically prioritizes improving\ntranslation quality and naturalness. However, an equally critical aspect in\naudio-visual content is lip-synchrony-ensuring that the movements of the lips\nmatch the spoken content-essential for maintaining realism in dubbed videos.\nDespite its importance, the inclusion of lip-synchrony constraints in AVS2S\nmodels has been largely overlooked. This study addresses this gap by\nintegrating a lip-synchrony loss into the training process of AVS2S models. Our\nproposed method significantly enhances lip-synchrony in direct audio-visual\nspeech-to-speech translation, achieving an average LSE-D score of 10.67,\nrepresenting a 9.2% reduction in LSE-D over a strong baseline across four\nlanguage pairs. Additionally, it maintains the naturalness and high quality of\nthe translated speech when overlaid onto the original video, without any\ndegradation in translation quality.\n","authors":["Lucas Goncalves","Prashant Mathur","Xing Niu","Brady Houston","Chandrashekhar Lavania","Srikanth Vishnubhotla","Lijia Sun","Anthony Ferritto"],"pdf_url":"https://arxiv.org/pdf/2412.16530v1.pdf","comment":"Accepted at ICASSP, 4 pages"},{"id":"http://arxiv.org/abs/2412.16495v1","updated":"2024-12-21T05:49:40Z","published":"2024-12-21T05:49:40Z","title":"Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video\n Generation via Pose Guidance","summary":" Text-editable and pose-controllable character video generation is a\nchallenging but prevailing topic with practical applications. However, existing\napproaches mainly focus on single-object video generation with pose guidance,\nignoring the realistic situation that multi-character appear concurrently in a\nscenario. To tackle this, we propose a novel multi-character video generation\nframework in a tuning-free manner, which is based on the separated text and\npose guidance. Specifically, we first extract character masks from the pose\nsequence to identify the spatial position for each generating character, and\nthen single prompts for each character are obtained with LLMs for precise text\nguidance. Moreover, the spatial-aligned cross attention and multi-branch\ncontrol module are proposed to generate fine grained controllable\nmulti-character video. The visualized results of generating video demonstrate\nthe precise controllability of our method for multi-character generation. We\nalso verify the generality of our method by applying it to various personalized\nT2I models. Moreover, the quantitative results show that our approach achieves\nsuperior performance compared with previous works.\n","authors":["Beiyuan Zhang","Yue Ma","Chunlei Fu","Xinyang Song","Zhenan Sun","Ziqiang Li"],"pdf_url":"https://arxiv.org/pdf/2412.16495v1.pdf","comment":"5 pages,conference"},{"id":"http://arxiv.org/abs/2408.15461v3","updated":"2024-12-21T01:42:38Z","published":"2024-08-28T00:54:51Z","title":"Hand1000: Generating Realistic Hands from Text with Only 1,000 Images","summary":" Text-to-image generation models have achieved remarkable advancements in\nrecent years, aiming to produce realistic images from textual descriptions.\nHowever, these models often struggle with generating anatomically accurate\nrepresentations of human hands. The resulting images frequently exhibit issues\nsuch as incorrect numbers of fingers, unnatural twisting or interlacing of\nfingers, or blurred and indistinct hands. These issues stem from the inherent\ncomplexity of hand structures and the difficulty in aligning textual\ndescriptions with precise visual depictions of hands. To address these\nchallenges, we propose a novel approach named Hand1000 that enables the\ngeneration of realistic hand images with target gesture using only 1,000\ntraining samples. The training of Hand1000 is divided into three stages with\nthe first stage aiming to enhance the model's understanding of hand anatomy by\nusing a pre-trained hand gesture recognition model to extract gesture\nrepresentation. The second stage further optimizes text embedding by\nincorporating the extracted hand gesture representation, to improve alignment\nbetween the textual descriptions and the generated hand images. The third stage\nutilizes the optimized embedding to fine-tune the Stable Diffusion model to\ngenerate realistic hand images. In addition, we construct the first publicly\navailable dataset specifically designed for text-to-hand image generation.\nBased on the existing hand gesture recognition dataset, we adopt advanced image\ncaptioning models and LLaMA3 to generate high-quality textual descriptions\nenriched with detailed gesture information. Extensive experiments demonstrate\nthat Hand1000 significantly outperforms existing models in producing\nanatomically correct hand images while faithfully representing other details in\nthe text, such as faces, clothing, and colors.\n","authors":["Haozhuo Zhang","Bin Zhu","Yu Cao","Yanbin Hao"],"pdf_url":"https://arxiv.org/pdf/2408.15461v3.pdf","comment":"Accepted by AAAI 2025. Project page\n https://haozhuo-zhang.github.io/Hand1000-project-page/"}]},"2024-12-20T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2411.12921v2","updated":"2024-12-20T23:02:35Z","published":"2024-11-19T23:19:46Z","title":"A Comparative Study of Text Retrieval Models on DaReCzech","summary":" This article presents a comprehensive evaluation of 7 off-the-shelf document\nretrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and\nGemma2 chosen to determine their performance on the Czech retrieval dataset\nDaReCzech. The primary objective of our experiments is to estimate the quality\nof modern retrieval approaches in the Czech language. Our analyses include\nretrieval quality, speed, and memory footprint. Secondly, we analyze whether it\nis better to use the model directly in Czech text, or to use machine\ntranslation into English, followed by retrieval in English. Our experiments\nidentify the most effective option for Czech information retrieval. The\nfindings revealed notable performance differences among the models, with\nGemma22 achieving the highest precision and recall, while Contriever performing\npoorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency\nand performance.\n","authors":["Jakub Stetina","Martin Fajcik","Michal Stefanik","Michal Hradis"],"pdf_url":"https://arxiv.org/pdf/2411.12921v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16382v1","updated":"2024-12-20T22:36:19Z","published":"2024-12-20T22:36:19Z","title":"EMPRA: Embedding Perturbation Rank Attack against Neural Ranking Models","summary":" Recent research has shown that neural information retrieval techniques may be\nsusceptible to adversarial attacks. Adversarial attacks seek to manipulate the\nranking of documents, with the intention of exposing users to targeted content.\nIn this paper, we introduce the Embedding Perturbation Rank Attack (EMPRA)\nmethod, a novel approach designed to perform adversarial attacks on black-box\nNeural Ranking Models (NRMs). EMPRA manipulates sentence-level embeddings,\nguiding them towards pertinent context related to the query while preserving\nsemantic integrity. This process generates adversarial texts that seamlessly\nintegrate with the original content and remain imperceptible to humans. Our\nextensive evaluation conducted on the widely-used MS MARCO V1 passage\ncollection demonstrate the effectiveness of EMPRA against a wide range of\nstate-of-the-art baselines in promoting a specific set of target documents\nwithin a given ranked results. Specifically, EMPRA successfully achieves a\nre-ranking of almost 96% of target documents originally ranked between 51-100\nto rank within the top 10. Furthermore, EMPRA does not depend on surrogate\nmodels for adversarial text generation, enhancing its robustness against\ndifferent NRMs in realistic settings.\n","authors":["Amin Bigdeli","Negar Arabzadeh","Ebrahim Bagheri","Charles L. A. Clarke"],"pdf_url":"https://arxiv.org/pdf/2412.16382v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14146v2","updated":"2024-12-20T20:25:03Z","published":"2024-12-18T18:44:08Z","title":"Advanced Reasoning and Transformation Engine for Multi-Step Insight\n Synthesis in Data Analytics with Large Language Models","summary":" This paper presents the Advanced Reasoning and Transformation Engine for\nMulti-Step Insight Synthesis in Data Analytics (ARTEMIS-DA), a novel framework\ndesigned to augment Large Language Models (LLMs) for solving complex,\nmulti-step data analytics tasks. ARTEMIS-DA integrates three core components:\nthe Planner, which dissects complex user queries into structured, sequential\ninstructions encompassing data preprocessing, transformation, predictive\nmodeling, and visualization; the Coder, which dynamically generates and\nexecutes Python code to implement these instructions; and the Grapher, which\ninterprets generated visualizations to derive actionable insights. By\norchestrating the collaboration between these components, ARTEMIS-DA\neffectively manages sophisticated analytical workflows involving advanced\nreasoning, multi-step transformations, and synthesis across diverse data\nmodalities. The framework achieves state-of-the-art (SOTA) performance on\nbenchmarks such as WikiTableQuestions and TabFact, demonstrating its ability to\ntackle intricate analytical tasks with precision and adaptability. By combining\nthe reasoning capabilities of LLMs with automated code generation and execution\nand visual analysis, ARTEMIS-DA offers a robust, scalable solution for\nmulti-step insight synthesis, addressing a wide range of challenges in data\nanalytics.\n","authors":["Atin Sakkeer Hussain"],"pdf_url":"https://arxiv.org/pdf/2412.14146v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16311v1","updated":"2024-12-20T19:49:12Z","published":"2024-12-20T19:49:12Z","title":"HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational\n Knowledge Bases","summary":" Given a semi-structured knowledge base (SKB), where text documents are\ninterconnected by relations, how can we effectively retrieve relevant\ninformation to answer user questions? Retrieval-Augmented Generation (RAG)\nretrieves documents to assist large language models (LLMs) in question\nanswering; while Graph RAG (GRAG) uses structured knowledge bases as its\nknowledge source. However, many questions require both textual and relational\ninformation from SKB - referred to as \"hybrid\" questions - which complicates\nthe retrieval process and underscores the need for a hybrid retrieval method\nthat leverages both information. In this paper, through our empirical analysis,\nwe identify key insights that show why existing methods may struggle with\nhybrid question answering (HQA) over SKB. Based on these insights, we propose\nHybGRAG for HQA consisting of a retriever bank and a critic module, with the\nfollowing advantages: (1) Agentic, it automatically refines the output by\nincorporating feedback from the critic module, (2) Adaptive, it solves hybrid\nquestions requiring both textual and relational information with the retriever\nbank, (3) Interpretable, it justifies decision making with intuitive refinement\npath, and (4) Effective, it surpasses all baselines on HQA benchmarks. In\nexperiments on the STaRK benchmark, HybGRAG achieves significant performance\ngains, with an average relative improvement in Hit@1 of 51%.\n","authors":["Meng-Chieh Lee","Qi Zhu","Costas Mavromatis","Zhen Han","Soji Adeshina","Vassilis N. Ioannidis","Huzefa Rangwala","Christos Faloutsos"],"pdf_url":"https://arxiv.org/pdf/2412.16311v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16086v1","updated":"2024-12-20T17:33:50Z","published":"2024-12-20T17:33:50Z","title":"Towards Interpretable Radiology Report Generation via Concept\n Bottlenecks using a Multi-Agentic RAG","summary":" Deep learning has advanced medical image classification, but interpretability\nchallenges hinder its clinical adoption. This study enhances interpretability\nin Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)\nand a multi-agent Retrieval-Augmented Generation (RAG) system for report\ngeneration. By modeling relationships between visual features and clinical\nconcepts, we create interpretable concept vectors that guide a multi-agent RAG\nsystem to generate radiology reports, enhancing clinical relevance,\nexplainability, and transparency. Evaluation of the generated reports using an\nLLM-as-a-judge confirmed the interpretability and clinical utility of our\nmodel's outputs. On the COVID-QU dataset, our model achieved 81% classification\naccuracy and demonstrated robust report generation performance, with five key\nmetrics ranging between 84% and 90%. This interpretable multi-agent framework\nbridges the gap between high-performance AI and the explainability required for\nreliable AI-driven CXR analysis in clinical settings.\n","authors":["Hasan Md Tusfiqur Alam","Devansh Srivastav","Md Abdul Kadir","Daniel Sonntag"],"pdf_url":"https://arxiv.org/pdf/2412.16086v1.pdf","comment":"Accepted in ECIR 2025"},{"id":"http://arxiv.org/abs/2412.15973v1","updated":"2024-12-20T15:18:02Z","published":"2024-12-20T15:18:02Z","title":"Legommenders: A Comprehensive Content-Based Recommendation Library with\n LLM Support","summary":" We present Legommenders, a unique library designed for content-based\nrecommendation that enables the joint training of content encoders alongside\nbehavior and interaction modules, thereby facilitating the seamless integration\nof content understanding directly into the recommendation pipeline.\nLegommenders allows researchers to effortlessly create and analyze over 1,000\ndistinct models across 15 diverse datasets. Further, it supports the\nincorporation of contemporary large language models, both as feature encoder\nand data generator, offering a robust platform for developing state-of-the-art\nrecommendation models and enabling more personalized and effective content\ndelivery.\n","authors":["Qijiong Liu","Lu Fan","Xiao-Ming Wu"],"pdf_url":"https://arxiv.org/pdf/2412.15973v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15957v1","updated":"2024-12-20T14:51:12Z","published":"2024-12-20T14:51:12Z","title":"From General to Specific: Tailoring Large Language Models for\n Personalized Healthcare","summary":" The rapid development of large language models (LLMs) has transformed many\nindustries, including healthcare. However, previous medical LLMs have largely\nfocused on leveraging general medical knowledge to provide responses, without\naccounting for patient variability and lacking true personalization at the\nindividual level. To address this, we propose a novel method called\npersonalized medical language model (PMLM), which explores and optimizes\npersonalized LLMs through recommendation systems and reinforcement learning\n(RL). Specifically, by utilizing self-informed and peer-informed\npersonalization, PMLM captures changes in behaviors and preferences to design\ninitial personalized prompts tailored to individual needs. We further refine\nthese initial personalized prompts through RL, ultimately enhancing the\nprecision of LLM guidance. Notably, the personalized prompt are hard prompt,\nwhich grants PMLM high adaptability and reusability, allowing it to directly\nleverage high-quality proprietary LLMs. We evaluate PMLM using real-world\nobstetrics and gynecology data, and the experimental results demonstrate that\nPMLM achieves personalized responses, and it provides more refined and\nindividualized services, offering a potential way for personalized medical\nLLMs.\n","authors":["Ruize Shi","Hong Huang","Wei Zhou","Kehan Yin","Kai Zhao","Yun Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.15957v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.16266v1","updated":"2024-12-20T10:30:06Z","published":"2024-12-20T10:30:06Z","title":"Learned Compression of Nonlinear Time Series With Random Access","summary":" Time series play a crucial role in many fields, including finance,\nhealthcare, industry, and environmental monitoring. The storage and retrieval\nof time series can be challenging due to their unstoppable growth. In fact,\nthese applications often sacrifice precious historical data to make room for\nnew data.\n General-purpose compressors can mitigate this problem with their good\ncompression ratios, but they lack efficient random access on compressed data,\nthus preventing real-time analyses. Ad-hoc streaming solutions, instead,\ntypically optimise only for compression and decompression speed, while giving\nup compression effectiveness and random access functionality. Furthermore, all\nthese methods lack awareness of certain special regularities of time series,\nwhose trends over time can often be described by some linear and nonlinear\nfunctions.\n To address these issues, we introduce NeaTS, a randomly-accessible\ncompression scheme that approximates the time series with a sequence of\nnonlinear functions of different kinds and shapes, carefully selected and\nplaced by a partitioning algorithm to minimise the space. The approximation\nresiduals are bounded, which allows storing them in little space and thus\nrecovering the original data losslessly, or simply discarding them to obtain a\nlossy time series representation with maximum error guarantees.\n Our experiments show that NeaTS improves the compression ratio of the\nstate-of-the-art lossy compressors that use linear or nonlinear functions (or\nboth) by up to 14%. Compared to lossless compressors, NeaTS emerges as the only\napproach to date providing, simultaneously, compression ratios close to or\nbetter than the best existing compressors, a much faster decompression speed,\nand orders of magnitude more efficient random access, thus enabling the storage\nand real-time analysis of massive and ever-growing amounts of (historical) time\nseries data.\n","authors":["Andrea Guerra","Giorgio Vinciguerra","Antonio Boffa","Paolo Ferragina"],"pdf_url":"https://arxiv.org/pdf/2412.16266v1.pdf","comment":"Accepted for publication in Proceedings of the 41st IEEE\n International Conference on Data Engineering (ICDE 2025)"},{"id":"http://arxiv.org/abs/2412.15759v1","updated":"2024-12-20T10:25:28Z","published":"2024-12-20T10:25:28Z","title":"ASPIRE: Assistive System for Performance Evaluation in IR","summary":" Information Retrieval (IR) evaluation involves far more complexity than\nmerely presenting performance measures in a table. Researchers often need to\ncompare multiple models across various dimensions, such as the Precision-Recall\ntrade-off and response time, to understand the reasons behind the varying\nperformance of specific queries for different models. We introduce ASPIRE\n(Assistive System for Performance Evaluation in IR), a visual analytics tool\ndesigned to address these complexities by providing an extensive and\nuser-friendly interface for in-depth analysis of IR experiments. ASPIRE\nsupports four key aspects of IR experiment evaluation and analysis:\nsingle/multi-experiment comparisons, query-level analysis, query\ncharacteristics-performance interplay, and collection-based retrieval analysis.\nWe showcase the functionality of ASPIRE using the TREC Clinical Trials\ncollection. ASPIRE is an open-source toolkit available online:\nhttps://github.com/GiorgosPeikos/ASPIRE\n","authors":["Georgios Peikos","Wojciech Kusa","Symeon Symeonidis"],"pdf_url":"https://arxiv.org/pdf/2412.15759v1.pdf","comment":"Accepted as a demo paper at the 47th European Conference on\n Information Retrieval (ECIR)"},{"id":"http://arxiv.org/abs/2412.14302v2","updated":"2024-12-20T07:56:22Z","published":"2024-12-18T20:10:42Z","title":"SAFERec: Self-Attention and Frequency Enriched Model for Next Basket\n Recommendation","summary":" Transformer-based approaches such as BERT4Rec and SASRec demonstrate strong\nperformance in Next Item Recommendation (NIR) tasks. However, applying these\narchitectures to Next-Basket Recommendation (NBR) tasks, which often involve\nhighly repetitive interactions, is challenging due to the vast number of\npossible item combinations in a basket. Moreover, frequency-based methods such\nas TIFU-KNN and UP-CF still demonstrate strong performance in NBR tasks,\nfrequently outperforming deep-learning approaches. This paper introduces\nSAFERec, a novel algorithm for NBR that enhances transformer-based\narchitectures from NIR by incorporating item frequency information,\nconsequently improving their applicability to NBR tasks. Extensive experiments\non multiple datasets show that SAFERec outperforms all other baselines,\nspecifically achieving an 8\\% improvement in Recall@10.\n","authors":["Oleg Lashinin","Denis Krasilnikov","Aleksandr Milogradskii","Marina Ananyeva"],"pdf_url":"https://arxiv.org/pdf/2412.14302v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15602v1","updated":"2024-12-20T06:50:31Z","published":"2024-12-20T06:50:31Z","title":"Music Genre Classification: Ensemble Learning with Subcomponents-level\n Attention","summary":" Music Genre Classification is one of the most popular topics in the fields of\nMusic Information Retrieval (MIR) and digital signal processing. Deep Learning\nhas emerged as the top performer for classifying music genres among various\nmethods. The letter introduces a novel approach by combining ensemble learning\nwith attention to sub-components, aiming to enhance the accuracy of identifying\nmusic genres. The core innovation of our work is the proposal to classify the\nsubcomponents of the music pieces separately, allowing our model to capture\ndistinct characteristics from those sub components. By applying ensemble\nlearning techniques to these individual classifications, we make the final\nclassification decision on the genre of the music. The proposed method has\nsuperior advantages in terms of accuracy compared to the other state-of-the-art\ntechniques trained and tested on the GTZAN dataset.\n","authors":["Yichen Liu","Abhijit Dasgupta","Qiwei He"],"pdf_url":"https://arxiv.org/pdf/2412.15602v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.13102v3","updated":"2024-12-20T05:42:38Z","published":"2024-12-17T17:15:21Z","title":"AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark","summary":" Evaluation plays a crucial role in the advancement of information retrieval\n(IR) models. However, current benchmarks, which are based on predefined domains\nand human-labeled data, face limitations in addressing evaluation needs for\nemerging domains both cost-effectively and efficiently. To address this\nchallenge, we propose the Automated Heterogeneous Information Retrieval\nBenchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1)\nAutomated. The testing data in AIR-Bench is automatically generated by large\nlanguage models (LLMs) without human intervention. 2) Heterogeneous. The\ntesting data in AIR-Bench is generated with respect to diverse tasks, domains\nand languages. 3) Dynamic. The domains and languages covered by AIR-Bench are\nconstantly augmented to provide an increasingly comprehensive evaluation\nbenchmark for community developers. We develop a reliable and robust data\ngeneration pipeline to automatically create diverse and high-quality evaluation\ndatasets based on real-world corpora. Our findings demonstrate that the\ngenerated testing data in AIR-Bench aligns well with human-labeled testing\ndata, making AIR-Bench a dependable benchmark for evaluating IR models. The\nresources in AIR-Bench are publicly available at\nhttps://github.com/AIR-Bench/AIR-Bench.\n","authors":["Jianlyu Chen","Nan Wang","Chaofan Li","Bo Wang","Shitao Xiao","Han Xiao","Hao Liao","Defu Lian","Zheng Liu"],"pdf_url":"https://arxiv.org/pdf/2412.13102v3.pdf","comment":"31 pages, 6 figures; Update Table 4 and Figure 3"},{"id":"http://arxiv.org/abs/2412.10674v2","updated":"2024-12-20T05:38:27Z","published":"2024-12-14T04:22:09Z","title":"USM: Unbiased Survey Modeling for Limiting Negative User Experiences in\n Recommendation Systems","summary":" Negative feedback signals are crucial to guardrail content recommendations\nand improve user experience. When these signals are effectively integrated into\nrecommendation systems, they play a vital role in preventing the promotion of\nharmful or undesirable content, thereby contributing to a healthier online\nenvironment. However, the challenges associated with negative signals are\nnoteworthy. Due to the limited visibility of options for users to express\nnegative feedback, these signals are often sparse compared to positive signals.\nThis imbalance can lead to a skewed understanding of user preferences,\nresulting in recommendations that prioritize short-term engagement over\nlong-term satisfaction. Moreover, an over-reliance on positive signals can\ncreate a filter bubble, where users are continuously exposed to content that\naligns with their immediate preferences but may not be beneficial in the long\nrun. This scenario can ultimately lead to user attrition as audiences become\ndisillusioned with the quality of the content provided. Additionally, existing\nuser signals frequently fail to meet specific customized requirements, such as\nunderstanding the underlying reasons for a user's likes or dislikes regarding a\nvideo. This lack of granularity hinders our ability to tailor content\nrecommendations effectively, as we cannot identify the particular attributes of\ncontent that resonate with individual users.\n","authors":["Chenghui Yu","Peiyi Li","Haoze Wu","Bingfeng Deng","Hongyu Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.10674v2.pdf","comment":"9 pages, 6 figures"},{"id":"http://arxiv.org/abs/2405.07803v3","updated":"2024-12-20T04:14:48Z","published":"2024-05-13T14:45:08Z","title":"Non-Random Data Encodes its Geometric and Topological Dimensions","summary":" Based on the principles of information theory, measure theory, and\ntheoretical computer science, we introduce a signal deconvolution method with a\nwide range of applications to coding theory, particularly in zero-knowledge\none-way communication channels, such as in deciphering messages (i.e., objects\nembedded into multidimensional spaces) from unknown generating sources about\nwhich no prior knowledge is available and to which no return message can be\nsent. Our multidimensional space reconstruction method from an arbitrary\nreceived signal is proven to be agnostic vis-\\`a-vis the encoding-decoding\nscheme, computation model, programming language, formal theory, the computable\n(or semi-computable) method of approximation to algorithmic complexity, and any\narbitrarily chosen (computable) probability measure. The method derives from\nthe principles of an approach to Artificial General Intelligence (AGI) capable\nof building a general-purpose model of models independent of any arbitrarily\nassumed prior probability distribution. We argue that this optimal and\nuniversal method of decoding non-random data has applications to signal\nprocessing, causal deconvolution, topological and geometric properties\nencoding, cryptography, and bio- and technosignature detection.\n","authors":["Hector Zenil","Felipe S. Abrahão","Luan C. S. M. Ozelim"],"pdf_url":"https://arxiv.org/pdf/2405.07803v3.pdf","comment":"arXiv:2303.16045 is based on this paper. arXiv admin note:\n substantial text overlap with arXiv:2303.16045"},{"id":"http://arxiv.org/abs/2304.07487v2","updated":"2024-12-20T03:51:41Z","published":"2023-04-15T06:35:28Z","title":"On User-side Fairness in Negative Sampling for Recommender Systems","summary":" Recommender systems are usually trained to discern between positive and\nnegative instances for each user. Negative sampling plays an important role in\nselecting informative negative items. Since positive data is disproportionately\ncontributed by a minority of active users, negative samplers might be affected\nby data imbalance thus choosing more informative negative items for active\nusers. Consequently, users with low participation are further underrepresented\nin the training data, potentially causing subpar treatment from recommenders.\nIn this paper we demonstrate empirically that active users receive more\naccurate recommendation than inactive users for state-of-the-art negative\nsampling strategies, and the degree of data imbalance influences the severity\nof performance disparities. We further show that the performance gain brought\nby sampling more negative instances for each positive item is unequally\ndistributed across user groups. Generally, active users benefit from\nperformance gain whereas inactive users might suffer from performance\ndegradation. To address these shortcomings, we propose a group-wise negative\nratio setup where we use the appropriate smaller negative ratio for inactive\nusers and a bigger ratio for active users. Comprehensive experiments show our\nproposed group-wise ratio outperforms a single global ratio in user-side\nfairness and performance improvement.\n","authors":["Yueqing Xuan","Kacper Sokol","Mark Sanderson","Jeffrey Chan"],"pdf_url":"https://arxiv.org/pdf/2304.07487v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15510v1","updated":"2024-12-20T02:48:59Z","published":"2024-12-20T02:48:59Z","title":"ADEQA: A Question Answer based approach for joint ADE-Suspect Extraction\n using Sequence-To-Sequence Transformers","summary":" Early identification of Adverse Drug Events (ADE) is critical for taking\nprompt actions while introducing new drugs into the market. These ADEs\ninformation are available through various unstructured data sources like\nclinical study reports, patient health records, social media posts, etc.\nExtracting ADEs and the related suspect drugs using machine learning is a\nchallenging task due to the complex linguistic relations between drug ADE pairs\nin textual data and unavailability of large corpus of labelled datasets. This\npaper introduces ADEQA, a question-answer(QA) based approach using quasi\nsupervised labelled data and sequence-to-sequence transformers to extract ADEs,\ndrug suspects and the relationships between them. Unlike traditional QA models,\nnatural language generation (NLG) based models don't require extensive token\nlevel labelling and thereby reduces the adoption barrier significantly. On a\npublic ADE corpus, we were able to achieve state-of-the-art results with an F1\nscore of 94% on establishing the relationships between ADEs and the respective\nsuspects.\n","authors":["Vinayak Arannil","Tomal Deb","Atanu Roy"],"pdf_url":"https://arxiv.org/pdf/2412.15510v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15494v1","updated":"2024-12-20T02:15:32Z","published":"2024-12-20T02:15:32Z","title":"PolySmart and VIREO @ TRECVid 2024 Ad-hoc Video Search","summary":" This year, we explore generation-augmented retrieval for the TRECVid AVS\ntask. Specifically, the understanding of textual query is enhanced by three\ngenerations, including Text2Text, Text2Image, and Image2Text, to address the\nout-of-vocabulary problem. Using different combinations of them and the rank\nlist retrieved by the original query, we submitted four automatic runs. For\nmanual runs, we use a large language model (LLM) (i.e., GPT4) to rephrase test\nqueries based on the concept bank of the search engine, and we manually check\nagain to ensure all the concepts used in the rephrased queries are in the bank.\nThe result shows that the fusion of the original and generated queries\noutperforms the original query on TV24 query sets. The generated queries\nretrieve different rank lists from the original query.\n","authors":["Jiaxin Wu","Chong-Wah Ngo","Xiao-Yong Wei","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2412.15494v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.15602v1","updated":"2024-12-20T06:50:31Z","published":"2024-12-20T06:50:31Z","title":"Music Genre Classification: Ensemble Learning with Subcomponents-level\n Attention","summary":" Music Genre Classification is one of the most popular topics in the fields of\nMusic Information Retrieval (MIR) and digital signal processing. Deep Learning\nhas emerged as the top performer for classifying music genres among various\nmethods. The letter introduces a novel approach by combining ensemble learning\nwith attention to sub-components, aiming to enhance the accuracy of identifying\nmusic genres. The core innovation of our work is the proposal to classify the\nsubcomponents of the music pieces separately, allowing our model to capture\ndistinct characteristics from those sub components. By applying ensemble\nlearning techniques to these individual classifications, we make the final\nclassification decision on the genre of the music. The proposed method has\nsuperior advantages in terms of accuracy compared to the other state-of-the-art\ntechniques trained and tested on the GTZAN dataset.\n","authors":["Yichen Liu","Abhijit Dasgupta","Qiwei He"],"pdf_url":"https://arxiv.org/pdf/2412.15602v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15514v1","updated":"2024-12-20T02:59:59Z","published":"2024-12-20T02:59:59Z","title":"PolySmart @ TRECVid 2024 Medical Video Question Answering","summary":" Video Corpus Visual Answer Localization (VCVAL) includes question-related\nvideo retrieval and visual answer localization in the videos. Specifically, we\nuse text-to-text retrieval to find relevant videos for a medical question based\non the similarity of video transcript and answers generated by GPT4. For the\nvisual answer localization, the start and end timestamps of the answer are\npredicted by the alignments on both visual content and subtitles with queries.\nFor the Query-Focused Instructional Step Captioning (QFISC) task, the step\ncaptions are generated by GPT4. Specifically, we provide the video captions\ngenerated by the LLaVA-Next-Video model and the video subtitles with timestamps\nas context, and ask GPT4 to generate step captions for the given medical query.\nWe only submit one run for evaluation and it obtains a F-score of 11.92 and\nmean IoU of 9.6527.\n","authors":["Jiaxin Wu","Yiyang Jiang","Xiao-Yong Wei","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2412.15514v1.pdf","comment":null}]},"2024-12-19T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.15404v1","updated":"2024-12-19T21:14:54Z","published":"2024-12-19T21:14:54Z","title":"A Retrieval-Augmented Generation Framework for Academic Literature\n Navigation in Data Science","summary":" In the rapidly evolving field of data science, efficiently navigating the\nexpansive body of academic literature is crucial for informed decision-making\nand innovation. This paper presents an enhanced Retrieval-Augmented Generation\n(RAG) application, an artificial intelligence (AI)-based system designed to\nassist data scientists in accessing precise and contextually relevant academic\nresources. The AI-powered application integrates advanced techniques, including\nthe GeneRation Of BIbliographic Data (GROBID) technique for extracting\nbibliographic information, fine-tuned embedding models, semantic chunking, and\nan abstract-first retrieval method, to significantly improve the relevance and\naccuracy of the retrieved information. This implementation of AI specifically\naddresses the challenge of academic literature navigation. A comprehensive\nevaluation using the Retrieval-Augmented Generation Assessment System (RAGAS)\nframework demonstrates substantial improvements in key metrics, particularly\nContext Relevance, underscoring the system's effectiveness in reducing\ninformation overload and enhancing decision-making processes. Our findings\nhighlight the potential of this enhanced Retrieval-Augmented Generation system\nto transform academic exploration within data science, ultimately advancing the\nworkflow of research and innovation in the field.\n","authors":["Ahmet Yasin Aytar","Kemal Kilic","Kamer Kaya"],"pdf_url":"https://arxiv.org/pdf/2412.15404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15396v1","updated":"2024-12-19T20:58:26Z","published":"2024-12-19T20:58:26Z","title":"Learning Visual Composition through Improved Semantic Guidance","summary":" Visual imagery does not consist of solitary objects, but instead reflects the\ncomposition of a multitude of fluid concepts. While there have been great\nadvances in visual representation learning, such advances have focused on\nbuilding better representations for a small number of discrete objects bereft\nof an understanding of how these objects are interacting. One can observe this\nlimitation in representations learned through captions or contrastive learning\n-- where the learned model treats an image essentially as a bag of words.\nSeveral works have attempted to address this limitation through the development\nof bespoke learned architectures to directly address the shortcomings in\ncompositional learning. In this work, we focus on simple, and scalable\napproaches. In particular, we demonstrate that by substantially improving\nweakly labeled data, i.e. captions, we can vastly improve the performance of\nstandard contrastive learning approaches. Previous CLIP models achieved near\nchance rate on challenging tasks probing compositional learning. However, our\nsimple approach boosts performance of CLIP substantially and surpasses all\nbespoke architectures. Furthermore, we showcase our results on a relatively new\ncaptioning benchmark derived from DOCCI. We demonstrate through a series of\nablations that a standard CLIP model trained with enhanced data may demonstrate\nimpressive performance on image retrieval tasks.\n","authors":["Austin Stone","Hagen Soltau","Robert Geirhos","Xi Yi","Ye Xia","Bingyi Cao","Kaifeng Chen","Abhijit Ogale","Jonathon Shlens"],"pdf_url":"https://arxiv.org/pdf/2412.15396v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.14567v2","updated":"2024-12-19T19:49:04Z","published":"2024-10-18T16:11:29Z","title":"ScopeQA: A Framework for Generating Out-of-Scope Questions for RAG","summary":" Conversational AI agents use Retrieval Augmented Generation (RAG) to provide\nverifiable document-grounded responses to user inquiries. However, many natural\nquestions do not have good answers: about 25\\% contain false\nassumptions~\\cite{Yu2023:CREPE}, and over 50\\% are\nambiguous~\\cite{DBLP:conf/emnlp/MinMHZ20}. RAG agents need high-quality data to\nimprove their responses to confusing questions. This paper presents a novel\nguided hallucination-based method to efficiently generate a diverse set of\nborderline out-of-scope confusing questions for a given document corpus. We\nconduct an empirical comparative evaluation of several large language models as\nRAG agents to measure the accuracy of confusion detection and appropriate\nresponse generation. We contribute a benchmark dataset to the public domain.\n","authors":["Zhiyuan Peng","Jinming Nian","Alexandre Evfimievski","Yi Fang"],"pdf_url":"https://arxiv.org/pdf/2410.14567v2.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2412.15093v1","updated":"2024-12-19T17:43:27Z","published":"2024-12-19T17:43:27Z","title":"Nano-ESG: Extracting Corporate Sustainability Information from News\n Articles","summary":" Determining the sustainability impact of companies is a highly complex\nsubject which has garnered more and more attention over the past few years.\nToday, investors largely rely on sustainability-ratings from established\nrating-providers in order to analyze how responsibly a company acts. However,\nthose ratings have recently been criticized for being hard to understand and\nnearly impossible to reproduce.\n An independent way to find out about the sustainability practices of\ncompanies lies in the rich landscape of news article data. In this paper, we\nexplore a different approach to identify key opportunities and challenges of\ncompanies in the sustainability domain. We present a novel dataset of more than\n840,000 news articles which were gathered for major German companies between\nJanuary 2023 and September 2024. By applying a mixture of Natural Language\nProcessing techniques, we first identify relevant articles, before summarizing\nthem and extracting their sustainability-related sentiment and aspect using\nLarge Language Models (LLMs). Furthermore, we conduct an evaluation of the\nobtained data and determine that the LLM-produced answers are accurate. We\nrelease both datasets at https://github.com/Bailefan/Nano-ESG.\n","authors":["Fabian Billert","Stefan Conrad"],"pdf_url":"https://arxiv.org/pdf/2412.15093v1.pdf","comment":"To be published at ECIR 2025. Preprint"},{"id":"http://arxiv.org/abs/2301.03767v2","updated":"2024-12-19T16:45:52Z","published":"2023-01-10T03:10:32Z","title":"Metric Compatible Training for Online Backfilling in Large-Scale\n Retrieval","summary":" Backfilling is the process of re-extracting all gallery embeddings from\nupgraded models in image retrieval systems. It inevitably requires a\nprohibitively large amount of computational cost and even entails the downtime\nof the service. Although backward-compatible learning sidesteps this challenge\nby tackling query-side representations, this leads to suboptimal solutions in\nprinciple because gallery embeddings cannot benefit from model upgrades. We\naddress this dilemma by introducing an online backfilling algorithm, which\nenables us to achieve a progressive performance improvement during the\nbackfilling process while not sacrificing the final performance of new model\nafter the completion of backfilling. To this end, we first propose a simple\ndistance rank merge technique for online backfilling. Then, we incorporate a\nreverse transformation module for more effective and efficient merging, which\nis further enhanced by adopting a metric-compatible contrastive learning\napproach. These two components help to make the distances of old and new models\ncompatible, resulting in desirable merge results during backfilling with no\nextra computational overhead. Extensive experiments show the effectiveness of\nour framework on four standard benchmarks in various settings.\n","authors":["Seonguk Seo","Mustafa Gokhan Uzunbas","Bohyung Han","Sara Cao","Ser-Nam Lim"],"pdf_url":"https://arxiv.org/pdf/2301.03767v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14978v1","updated":"2024-12-19T15:53:21Z","published":"2024-12-19T15:53:21Z","title":"Spectrum-based Modality Representation Fusion Graph Convolutional\n Network for Multimodal Recommendation","summary":" Incorporating multi-modal features as side information has recently become a\ntrend in recommender systems. To elucidate user-item preferences, recent\nstudies focus on fusing modalities via concatenation, element-wise sum, or\nattention mechanisms. Despite having notable success, existing approaches do\nnot account for the modality-specific noise encapsulated within each modality.\nAs a result, direct fusion of modalities will lead to the amplification of\ncross-modality noise. Moreover, the variation of noise that is unique within\neach modality results in noise alleviation and fusion being more challenging.\nIn this work, we propose a new Spectrum-based Modality Representation (SMORE)\nfusion graph recommender that aims to capture both uni-modal and fusion\npreferences while simultaneously suppressing modality noise. Specifically,\nSMORE projects the multi-modal features into the frequency domain and leverages\nthe spectral space for fusion. To reduce dynamic contamination that is unique\nto each modality, we introduce a filter to attenuate and suppress the modality\nnoise adaptively while capturing the universal modality patterns effectively.\nFurthermore, we explore the item latent structures by designing a new\nmulti-modal graph learning module to capture associative semantic correlations\nand universal fusion patterns among similar items. Finally, we formulate a new\nmodality-aware preference module, which infuses behavioral features and\nbalances the uni- and multi-modal features for precise preference modeling.\nThis empowers SMORE with the ability to infer both user modality-specific and\nfusion preferences more accurately. Experiments on three real-world datasets\nshow the efficacy of our proposed model. The source code for this work has been\nmade publicly available at https://github.com/kennethorq/SMORE.\n","authors":["Rongqing Kenneth Ong","Andy W. H. Khong"],"pdf_url":"https://arxiv.org/pdf/2412.14978v1.pdf","comment":"Accepted to ACM Web Search and Data Mining (WSDM) 2025"},{"id":"http://arxiv.org/abs/2412.14967v1","updated":"2024-12-19T15:45:06Z","published":"2024-12-19T15:45:06Z","title":"ECLIPSE: Contrastive Dimension Importance Estimation with\n Pseudo-Irrelevance Feedback for Dense Retrieval","summary":" Recent advances in Information Retrieval have leveraged high-dimensional\nembedding spaces to improve the retrieval of relevant documents. Moreover, the\nManifold Clustering Hypothesis suggests that despite these high-dimensional\nrepresentations, documents relevant to a query reside on a lower-dimensional,\nquery-dependent manifold. While this hypothesis has inspired new retrieval\nmethods, existing approaches still face challenges in effectively separating\nnon-relevant information from relevant signals. We propose a novel methodology\nthat addresses these limitations by leveraging information from both relevant\nand non-relevant documents. Our method, ECLIPSE, computes a centroid based on\nirrelevant documents as a reference to estimate noisy dimensions present in\nrelevant ones, enhancing retrieval performance. Extensive experiments on three\nin-domain and one out-of-domain benchmarks demonstrate an average improvement\nof up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10\nw.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our\nresults pave the way for more robust, pseudo-irrelevance-based retrieval\nsystems in future IR research.\n","authors":["Giulio D'Erasmo","Giovanni Trappolini","Nicola Tonellotto","Fabrizio Silvestri"],"pdf_url":"https://arxiv.org/pdf/2412.14967v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.00326v5","updated":"2024-12-19T15:07:38Z","published":"2023-12-01T03:44:54Z","title":"Agent-OM: Leveraging LLM Agents for Ontology Matching","summary":" Ontology matching (OM) enables semantic interoperability between different\nontologies and resolves their conceptual heterogeneity by aligning related\nentities. OM systems currently have two prevailing design paradigms:\nconventional knowledge-based expert systems and newer machine learning-based\npredictive systems. While large language models (LLMs) and LLM agents have\nrevolutionised data engineering and have been applied creatively in many\ndomains, their potential for OM remains underexplored. This study introduces a\nnovel agent-powered LLM-based design paradigm for OM systems. With\nconsideration of several specific challenges in leveraging LLM agents for OM,\nwe propose a generic framework, namely Agent-OM (Agent for Ontology Matching),\nconsisting of two Siamese agents for retrieval and matching, with a set of\nsimple OM tools. Our framework is implemented in a proof-of-concept system.\nEvaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks\nover state-of-the-art OM systems show that our system can achieve results very\nclose to the long-standing best performance on simple OM tasks and can\nsignificantly improve the performance on complex and few-shot OM tasks.\n","authors":["Zhangcheng Qiang","Weiqing Wang","Kerry Taylor"],"pdf_url":"https://arxiv.org/pdf/2312.00326v5.pdf","comment":"19 pages, 13 figures, 4 tables"},{"id":"http://arxiv.org/abs/2412.15310v1","updated":"2024-12-19T15:02:33Z","published":"2024-12-19T15:02:33Z","title":"MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code\n from UI Designs","summary":" Multi-page websites dominate modern web development. However, existing\ndesign-to-code methods rely on simplified assumptions, limiting to single-page,\nself-contained webpages without external resource connection. To address this\ngap, we introduce the Multi-Page Resource-Aware Webpage (MRWeb) generation\ntask, which transforms UI designs into multi-page, functional web UIs with\ninternal/external navigation, image loading, and backend routing. We propose a\nnovel resource list data structure to track resources, links, and design\ncomponents. Our study applies existing methods to the MRWeb problem using a\nnewly curated dataset of 500 websites (300 synthetic, 200 real-world).\nSpecifically, we identify the best metric to evaluate the similarity of the web\nUI, assess the impact of the resource list on MRWeb generation, analyze MLLM\nlimitations, and evaluate the effectiveness of the MRWeb tool in real-world\nworkflows. The results show that resource lists boost navigation functionality\nfrom 0% to 66%-80% while facilitating visual similarity. Our proposed metrics\nand evaluation framework provide new insights into MLLM performance on MRWeb\ntasks. We release the MRWeb tool, dataset, and evaluation framework to promote\nfurther research.\n","authors":["Yuxuan Wan","Yi Dong","Jingyu Xiao","Yintong Huo","Wenxuan Wang","Michael R. Lyu"],"pdf_url":"https://arxiv.org/pdf/2412.15310v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.00390v2","updated":"2024-12-19T14:28:19Z","published":"2024-02-01T07:22:52Z","title":"DNS-Rec: Data-aware Neural Architecture Search for Recommender Systems","summary":" In the era of data proliferation, efficiently sifting through vast\ninformation to extract meaningful insights has become increasingly crucial.\nThis paper addresses the computational overhead and resource inefficiency\nprevalent in existing Sequential Recommender Systems (SRSs). We introduce an\ninnovative approach combining pruning methods with advanced model designs.\nFurthermore, we delve into resource-constrained Neural Architecture Search\n(NAS), an emerging technique in recommender systems, to optimize models in\nterms of FLOPs, latency, and energy consumption while maintaining or enhancing\naccuracy. Our principal contribution is the development of a Data-aware Neural\nArchitecture Search for Recommender System (DNS-Rec). DNS-Rec is specifically\ndesigned to tailor compact network architectures for attention-based SRS\nmodels, thereby ensuring accuracy retention. It incorporates data-aware gates\nto enhance the performance of the recommendation network by learning\ninformation from historical user-item interactions. Moreover, DNS-Rec employs a\ndynamic resource constraint strategy, stabilizing the search process and\nyielding more suitable architectural solutions. We demonstrate the\neffectiveness of our approach through rigorous experiments conducted on three\nbenchmark datasets, which highlight the superiority of DNS-Rec in SRSs. Our\nfindings set a new standard for future research in efficient and accurate\nrecommendation systems, marking a significant step forward in this rapidly\nevolving field.\n","authors":["Sheng Zhang","Maolin Wang","Yao Zhao","Chenyi Zhuang","Jinjie Gu","Ruocheng Guo","Xiangyu Zhao","Zijian Zhang","Hongzhi Yin"],"pdf_url":"https://arxiv.org/pdf/2402.00390v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15308v1","updated":"2024-12-19T13:41:59Z","published":"2024-12-19T13:41:59Z","title":"ViFactCheck: A New Benchmark Dataset and Methods for Multi-domain News\n Fact-Checking in Vietnamese","summary":" The rapid spread of information in the digital age highlights the critical\nneed for effective fact-checking tools, particularly for languages with limited\nresources, such as Vietnamese. In response to this challenge, we introduce\nViFactCheck, the first publicly available benchmark dataset designed\nspecifically for Vietnamese fact-checking across multiple online news domains.\nThis dataset contains 7,232 human-annotated pairs of claim-evidence\ncombinations sourced from reputable Vietnamese online news, covering 12 diverse\ntopics. It has been subjected to a meticulous annotation process to ensure high\nquality and reliability, achieving a Fleiss Kappa inter-annotator agreement\nscore of 0.83. Our evaluation leverages state-of-the-art pre-trained and large\nlanguage models, employing fine-tuning and prompting techniques to assess\nperformance. Notably, the Gemma model demonstrated superior effectiveness, with\nan impressive macro F1 score of 89.90%, thereby establishing a new standard for\nfact-checking benchmarks. This result highlights the robust capabilities of\nGemma in accurately identifying and verifying facts in Vietnamese. To further\npromote advances in fact-checking technology and improve the reliability of\ndigital media, we have made the ViFactCheck dataset, model checkpoints,\nfact-checking pipelines, and source code freely available on GitHub. This\ninitiative aims to inspire further research and enhance the accuracy of\ninformation in low-resource languages.\n","authors":["Tran Thai Hoa","Tran Quang Duy","Khanh Quoc Tran","Kiet Van Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.15308v1.pdf","comment":"Accepted at AAAI'2025 Main Conference"},{"id":"http://arxiv.org/abs/2412.14835v1","updated":"2024-12-19T13:25:39Z","published":"2024-12-19T13:25:39Z","title":"Progressive Multimodal Reasoning via Active Retrieval","summary":" Multi-step multimodal reasoning tasks pose significant challenges for\nmultimodal large language models (MLLMs), and finding effective ways to enhance\ntheir performance in such scenarios remains an unresolved issue. In this paper,\nwe propose AR-MCTS, a universal framework designed to progressively improve the\nreasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo\nTree Search (MCTS). Our approach begins with the development of a unified\nretrieval module that retrieves key supporting insights for solving complex\nreasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in\nautomated multimodal reasoning verification, we employ the MCTS algorithm\ncombined with an active retrieval mechanism, which enables the automatic\ngeneration of step-wise annotations. This strategy dynamically retrieves key\ninsights for each reasoning step, moving beyond traditional beam search\nsampling to improve the diversity and reliability of the reasoning space.\nAdditionally, we introduce a process reward model that aligns progressively to\nsupport the automatic verification of multimodal reasoning tasks. Experimental\nresults across three complex multimodal reasoning benchmarks confirm the\neffectiveness of the AR-MCTS framework in enhancing the performance of various\nmultimodal models. Further analysis demonstrates that AR-MCTS can optimize\nsampling diversity and accuracy, yielding reliable multimodal reasoning.\n","authors":["Guanting Dong","Chenghao Zhang","Mengjie Deng","Yutao Zhu","Zhicheng Dou","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2412.14835v1.pdf","comment":"Working in progress"},{"id":"http://arxiv.org/abs/2406.05666v9","updated":"2024-12-19T12:13:26Z","published":"2024-06-09T06:49:22Z","title":"Probability Distribution Learning and Its Application in Deep Learning","summary":" This paper introduces a novel theoretical learning framework, termed\nprobability distribution learning (PD learning). Departing from the traditional\nstatistical learning framework, PD learning focuses on learning the underlying\nprobability distribution, which is modeled as a random variable within the\nprobability simplex. In this framework, the optimization objective is the\nlearning error, which quantifies the posterior expected discrepancy between the\nmodel's predicted distribution and the underlying true distribution, given\navailable sample data and prior knowledge. To optimize the learning error, this\npaper proposes the necessary conditions for loss functions, models, and\noptimization algorithms, ensuring that these conditions are met in real-world\nmachine learning scenarios. Based on these conditions, the non-convex\noptimization mechanism corresponding to model training can be theoretically\nresolved. Moreover, this paper provides model-dependent and model-independent\nbounds on learning error, offering new insights into the model's fitting and\ngeneralization capabilities. Furthermore, the paper applies the PD learning\nframework to elucidate the mechanisms by which various techniques, including\nrandom parameter initialization, over-parameterization, and dropout, influence\ndeep model training. Finally, the paper substantiates the key conclusions of\nthe proposed framework through experimental results.\n","authors":["Binchuan Qi"],"pdf_url":"https://arxiv.org/pdf/2406.05666v9.pdf","comment":"arXiv admin note: text overlap with arXiv:2105.04026 by other\n authors. arXiv admin note: text overlap with arXiv:2105.04026 by other\n authors"},{"id":"http://arxiv.org/abs/2411.04677v3","updated":"2024-12-19T12:08:31Z","published":"2024-11-07T13:03:21Z","title":"Lightning IR: Straightforward Fine-tuning and Inference of\n Transformer-based Language Models for Information Retrieval","summary":" A wide range of transformer-based language models have been proposed for\ninformation retrieval tasks. However, including transformer-based models in\nretrieval pipelines is often complex and requires substantial engineering\neffort. In this paper, we introduce Lightning IR, an easy-to-use PyTorch\nLightning-based framework for applying transformer-based language models in\nretrieval scenarios. Lightning IR provides a modular and extensible\narchitecture that supports all stages of a retrieval pipeline: from fine-tuning\nand indexing to searching and re-ranking. Designed to be scalable and\nreproducible, Lightning IR is available as open-source:\nhttps://github.com/webis-de/lightning-ir.\n","authors":["Ferdinand Schlatt","Maik Fröbe","Matthias Hagen"],"pdf_url":"https://arxiv.org/pdf/2411.04677v3.pdf","comment":"Accepted as a demo at WSDM'25"},{"id":"http://arxiv.org/abs/2412.11216v2","updated":"2024-12-19T08:32:20Z","published":"2024-12-15T15:13:14Z","title":"Distribution-Consistency-Guided Multi-modal Hashing","summary":" Multi-modal hashing methods have gained popularity due to their fast speed\nand low storage requirements. Among them, the supervised methods demonstrate\nbetter performance by utilizing labels as supervisory signals compared with\nunsupervised methods. Currently, for almost all supervised multi-modal hashing\nmethods, there is a hidden assumption that training sets have no noisy labels.\nHowever, labels are often annotated incorrectly due to manual labeling in\nreal-world scenarios, which will greatly harm the retrieval performance. To\naddress this issue, we first discover a significant distribution consistency\npattern through experiments, i.e., the 1-0 distribution of the presence or\nabsence of each category in the label is consistent with the high-low\ndistribution of similarity scores of the hash codes relative to category\ncenters. Then, inspired by this pattern, we propose a novel\nDistribution-Consistency-Guided Multi-modal Hashing (DCGMH), which aims to\nfilter and reconstruct noisy labels to enhance retrieval performance.\nSpecifically, the proposed method first randomly initializes several category\ncenters, which are used to compute the high-low distribution of similarity\nscores; Noisy and clean labels are then separately filtered out via the\ndiscovered distribution consistency pattern to mitigate the impact of noisy\nlabels; Subsequently, a correction strategy, which is indirectly designed via\nthe distribution consistency pattern, is applied to the filtered noisy labels,\ncorrecting high-confidence ones while treating low-confidence ones as unlabeled\nfor unsupervised learning, thereby further enhancing the model's performance.\nExtensive experiments on three widely used datasets demonstrate the superiority\nof the proposed method compared to state-of-the-art baselines in multi-modal\nretrieval tasks. The code is available at\nhttps://github.com/LiuJinyu1229/DCGMH.\n","authors":["Jin-Yu Liu","Xian-Ling Mao","Tian-Yi Che","Rong-Cheng Tu"],"pdf_url":"https://arxiv.org/pdf/2412.11216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.12470v2","updated":"2024-12-19T08:26:32Z","published":"2024-08-22T15:10:56Z","title":"DLCRec: A Novel Approach for Managing Diversity in LLM-Based Recommender\n Systems","summary":" The integration of Large Language Models (LLMs) into recommender systems has\nled to substantial performance improvements. However, this often comes at the\ncost of diminished recommendation diversity, which can negatively impact user\nsatisfaction. To address this issue, controllable recommendation has emerged as\na promising approach, allowing users to specify their preferences and receive\nrecommendations that meet their diverse needs. Despite its potential, existing\ncontrollable recommender systems frequently rely on simplistic mechanisms, such\nas a single prompt, to regulate diversity-an approach that falls short of\ncapturing the full complexity of user preferences. In response to these\nlimitations, we propose DLCRec, a novel framework designed to enable\nfine-grained control over diversity in LLM-based recommendations. Unlike\ntraditional methods, DLCRec adopts a fine-grained task decomposition strategy,\nbreaking down the recommendation process into three sequential sub-tasks: genre\nprediction, genre filling, and item prediction. These sub-tasks are trained\nindependently and inferred sequentially according to user-defined control\nnumbers, ensuring more precise control over diversity. Furthermore, the\nscarcity and uneven distribution of diversity-related user behavior data pose\nsignificant challenges for fine-tuning. To overcome these obstacles, we\nintroduce two data augmentation techniques that enhance the model's robustness\nto noisy and out-of-distribution data. These techniques expose the model to a\nbroader range of patterns, improving its adaptability in generating\nrecommendations with varying levels of diversity. Our extensive empirical\nevaluation demonstrates that DLCRec not only provides precise control over\ndiversity but also outperforms state-of-the-art baselines across multiple\nrecommendation scenarios.\n","authors":["Jiaju Chen","Chongming Gao","Shuai Yuan","Shuchang Liu","Qingpeng Cai","Peng Jiang"],"pdf_url":"https://arxiv.org/pdf/2408.12470v2.pdf","comment":"Accepted by WSDM 2025"},{"id":"http://arxiv.org/abs/2412.14574v1","updated":"2024-12-19T06:44:59Z","published":"2024-12-19T06:44:59Z","title":"Sliding Windows Are Not the End: Exploring Full Ranking with\n Long-Context Large Language Models","summary":" Large Language Models (LLMs) have shown exciting performance in listwise\npassage ranking. Due to the limited input length, existing methods often adopt\nthe sliding window strategy. Such a strategy, though effective, is inefficient\nas it involves repetitive and serialized processing, which usually re-evaluates\nrelevant passages multiple times. As a result, it incurs redundant API costs,\nwhich are proportional to the number of inference tokens. The development of\nlong-context LLMs enables the full ranking of all passages within a single\ninference, avoiding redundant API costs. In this paper, we conduct a\ncomprehensive study of long-context LLMs for ranking tasks in terms of\nefficiency and effectiveness. Surprisingly, our experiments reveal that full\nranking with long-context LLMs can deliver superior performance in the\nsupervised fine-tuning setting with a huge efficiency improvement. Furthermore,\nwe identify two limitations of fine-tuning the full ranking model based on\nexisting methods: (1) sliding window strategy fails to produce a full ranking\nlist as a training label, and (2) the language modeling loss cannot emphasize\ntop-ranked passage IDs in the label. To alleviate these issues, we propose a\nnew complete listwise label construction approach and a novel importance-aware\nlearning objective for full ranking. Experiments show the superior performance\nof our method over baselines. Our codes are available at\n\\url{https://github.com/8421BCD/fullrank}.\n","authors":["Wenhan Liu","Xinyu Ma","Yutao Zhu","Ziliang Zhao","Shuaiqiang Wang","Dawei Yin","Zhicheng Dou"],"pdf_url":"https://arxiv.org/pdf/2412.14574v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2405.00287v2","updated":"2024-12-19T05:48:08Z","published":"2024-05-01T02:27:59Z","title":"SCONE: A Novel Stochastic Sampling to Generate Contrastive Views and\n Hard Negative Samples for Recommendation","summary":" Graph-based collaborative filtering (CF) has emerged as a promising approach\nin recommender systems. Despite its achievements, graph-based CF models face\nchallenges due to data sparsity and negative sampling. In this paper, we\npropose a novel Stochastic sampling for i) COntrastive views and ii) hard\nNEgative samples (SCONE) to overcome these issues. SCONE generates dynamic\naugmented views and diverse hard negative samples via a unified stochastic\nsampling approach based on score-based generative models. Our extensive\nexperiments on 6 benchmark datasets show that SCONE consistently outperforms\nstate-of-the-art baselines. SCONE shows efficacy in addressing user sparsity\nand item popularity issues, while enhancing performance for both cold-start\nusers and long-tail items. Furthermore, our approach improves the diversity of\nthe recommendation and the uniformity of the representations. The code is\navailable at https://github.com/jeongwhanchoi/SCONE.\n","authors":["Chaejeong Lee","Jeongwhan Choi","Hyowon Wi","Sung-Bae Cho","Noseong Park"],"pdf_url":"https://arxiv.org/pdf/2405.00287v2.pdf","comment":"Accepted to WSDM 2025. Chaejeong Lee and Jeongwhan Choi are co-first\n authors with equal contributions"},{"id":"http://arxiv.org/abs/2412.14518v1","updated":"2024-12-19T04:33:22Z","published":"2024-12-19T04:33:22Z","title":"Efficient Self-Supervised Video Hashing with Selective State Spaces","summary":" Self-supervised video hashing (SSVH) is a practical task in video indexing\nand retrieval. Although Transformers are predominant in SSVH for their\nimpressive temporal modeling capabilities, they often suffer from computational\nand memory inefficiencies. Drawing inspiration from Mamba, an advanced\nstate-space model, we explore its potential in SSVH to achieve a better balance\nbetween efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing\nmodel with an improved self-supervised learning paradigm. Specifically, we\ndesign bidirectional Mamba layers for both the encoder and decoder, which are\neffective and efficient in capturing temporal relationships thanks to the\ndata-dependent selective scanning mechanism with linear complexity. In our\nlearning strategy, we transform global semantics in the feature space into\nsemantically consistent and discriminative hash centers, followed by a center\nalignment loss as a global learning signal. Our self-local-global (SLG)\nparadigm significantly improves learning efficiency, leading to faster and\nbetter convergence. Extensive experiments demonstrate S5VH's improvements over\nstate-of-the-art methods, superior transferability, and scalable advantages in\ninference efficiency. Code is available at\nhttps://github.com/gimpong/AAAI25-S5VH.\n","authors":["Jinpeng Wang","Niu Lian","Jun Li","Yuting Wang","Yan Feng","Bin Chen","Yongbing Zhang","Shu-Tao Xia"],"pdf_url":"https://arxiv.org/pdf/2412.14518v1.pdf","comment":"Accepted by AAAI'25. 9 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.14486v1","updated":"2024-12-19T03:19:18Z","published":"2024-12-19T03:19:18Z","title":"Moving Beyond LDA: A Comparison of Unsupervised Topic Modelling\n Techniques for Qualitative Data Analysis of Online Communities","summary":" Social media constitutes a rich and influential source of information for\nqualitative researchers. Although computational techniques like topic modelling\nassist with managing the volume and diversity of social media content,\nqualitative researcher's lack of programming expertise creates a significant\nbarrier to their adoption. In this paper we explore how BERTopic, an advanced\nLarge Language Model (LLM)-based topic modelling technique, can support\nqualitative data analysis of social media. We conducted interviews and hands-on\nevaluations in which qualitative researchers compared topics from three\nmodelling techniques: LDA, NMF, and BERTopic. BERTopic was favoured by 8 of 12\nparticipants for its ability to provide detailed, coherent clusters for deeper\nunderstanding and actionable insights. Participants also prioritised topic\nrelevance, logical organisation, and the capacity to reveal unexpected\nrelationships within the data. Our findings underscore the potential of\nLLM-based techniques for supporting qualitative analysis.\n","authors":["Amandeep Kaur","James R. Wallace"],"pdf_url":"https://arxiv.org/pdf/2412.14486v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14476v1","updated":"2024-12-19T02:57:02Z","published":"2024-12-19T02:57:02Z","title":"HEC-GCN: Hypergraph Enhanced Cascading Graph Convolution Network for\n Multi-Behavior Recommendation","summary":" Multi-behavior recommendation (MBR) has garnered growing attention recently\ndue to its ability to mitigate the sparsity issue by inferring user preferences\nfrom various auxiliary behaviors to improve predictions for the target\nbehavior. Although existing research on MBR has yielded impressive results,\nthey still face two major limitations. First, previous methods mainly focus on\nmodeling fine-grained interaction information between users and items under\neach behavior, which may suffer from sparsity issue. Second, existing models\nusually concentrate on exploiting dependencies between two consecutive\nbehaviors, leaving intra- and inter-behavior consistency largely unexplored. To\nthe end, we propose a novel approach named Hypergraph Enhanced Cascading Graph\nConvolution Network for multi-behavior recommendation (HEC-GCN). To be\nspecific, we first explore both fine- and coarse-grained correlations among\nusers or items of each behavior by simultaneously modeling the\nbehavior-specific interaction graph and its corresponding hypergraph in a\ncascaded manner. Then, we propose a behavior consistency-guided alignment\nstrategy that ensures consistent representations between the interaction graph\nand its associated hypergraph for each behavior, while also maintaining\nrepresentation consistency across different behaviors. Extensive experiments\nand analyses on three public benchmark datasets demonstrate that our proposed\napproach is consistently superior to previous state-of-the-art methods due to\nits capability to effectively attenuate the sparsity issue as well as preserve\nboth intra- and inter-behavior consistencies. The code is available at\nhttps://github.com/marqu22/HEC-GCN.git.\n","authors":["Yabo Yin","Xiaofei Zhu","Wenshan Wang","Yihao Zhang","Pengfei Wang","Yixing Fan","Jiafeng Guo"],"pdf_url":"https://arxiv.org/pdf/2412.14476v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.14768v3","updated":"2024-12-19T02:18:54Z","published":"2024-05-23T16:35:52Z","title":"WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of\n Large Language Models","summary":" Large language models (LLMs) need knowledge updates to meet the ever-growing\nworld facts and correct the hallucinated responses, facilitating the methods of\nlifelong model editing. Where the updated knowledge resides in memories is a\nfundamental question for model editing. In this paper, we find that editing\neither long-term memory (direct model parameters) or working memory\n(non-parametric knowledge of neural network activations/representations by\nretrieval) will result in an impossible triangle -- reliability,\ngeneralization, and locality can not be realized together in the lifelong\nediting settings. For long-term memory, directly editing the parameters will\ncause conflicts with irrelevant pretrained knowledge or previous edits (poor\nreliability and locality). For working memory, retrieval-based activations can\nhardly make the model understand the edits and generalize (poor\ngeneralization). Therefore, we propose WISE to bridge the gap between memories.\nIn WISE, we design a dual parametric memory scheme, which consists of the main\nmemory for the pretrained knowledge and a side memory for the edited knowledge.\nWe only edit the knowledge in the side memory and train a router to decide\nwhich memory to go through when given a query. For continual editing, we devise\na knowledge-sharding mechanism where different sets of edits reside in distinct\nsubspaces of parameters, and are subsequently merged into a shared memory\nwithout conflicts. Extensive experiments show that WISE can outperform previous\nmodel editing methods and overcome the impossible triangle under lifelong model\nediting of question answering, hallucination, and out-of-distribution settings\nacross trending LLM architectures, e.g., GPT, LLaMA, and Mistral. Code is\navailable at https://github.com/zjunlp/EasyEdit.\n","authors":["Peng Wang","Zexi Li","Ningyu Zhang","Ziwen Xu","Yunzhi Yao","Yong Jiang","Pengjun Xie","Fei Huang","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2405.14768v3.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.14457v1","updated":"2024-12-19T02:17:35Z","published":"2024-12-19T02:17:35Z","title":"VISA: Retrieval Augmented Generation with Visual Source Attribution","summary":" Generation with source attribution is important for enhancing the\nverifiability of retrieval-augmented generation (RAG) systems. However,\nexisting approaches in RAG primarily link generated content to document-level\nreferences, making it challenging for users to locate evidence among multiple\ncontent-rich retrieved documents. To address this challenge, we propose\nRetrieval-Augmented Generation with Visual Source Attribution (VISA), a novel\napproach that combines answer generation with visual source attribution.\nLeveraging large vision-language models (VLMs), VISA identifies the evidence\nand highlights the exact regions that support the generated answers with\nbounding boxes in the retrieved document screenshots. To evaluate its\neffectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia\nwebpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the\nmedical domain. Experimental results demonstrate the effectiveness of VISA for\nvisual source attribution on documents' original look, as well as highlighting\nthe challenges for improvement. Code, data, and model checkpoints will be\nreleased.\n","authors":["Xueguang Ma","Shengyao Zhuang","Bevan Koopman","Guido Zuccon","Wenhu Chen","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2412.14457v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17969v3","updated":"2024-12-19T02:10:00Z","published":"2024-05-28T08:56:33Z","title":"Knowledge Circuits in Pretrained Transformers","summary":" The remarkable capabilities of modern large language models are rooted in\ntheir vast repositories of knowledge encoded within their parameters, enabling\nthem to perceive the world and engage in reasoning. The inner workings of how\nthese models store knowledge have long been a subject of intense interest and\ninvestigation among researchers. To date, most studies have concentrated on\nisolated components within these models, such as the Multilayer Perceptrons and\nattention head. In this paper, we delve into the computation graph of the\nlanguage model to uncover the knowledge circuits that are instrumental in\narticulating specific knowledge. The experiments, conducted with GPT2 and\nTinyLLAMA, have allowed us to observe how certain information heads, relation\nheads, and Multilayer Perceptrons collaboratively encode knowledge within the\nmodel. Moreover, we evaluate the impact of current knowledge editing techniques\non these knowledge circuits, providing deeper insights into the functioning and\nconstraints of these editing methodologies. Finally, we utilize knowledge\ncircuits to analyze and interpret language model behaviors such as\nhallucinations and in-context learning. We believe the knowledge circuits hold\npotential for advancing our understanding of Transformers and guiding the\nimproved design of knowledge editing. Code and data are available in\nhttps://github.com/zjunlp/KnowledgeCircuits.\n","authors":["Yunzhi Yao","Ningyu Zhang","Zekun Xi","Mengru Wang","Ziwen Xu","Shumin Deng","Huajun Chen"],"pdf_url":"https://arxiv.org/pdf/2405.17969v3.pdf","comment":"NeurIPS 2024, 26 pages"},{"id":"http://arxiv.org/abs/2412.14454v1","updated":"2024-12-19T02:09:59Z","published":"2024-12-19T02:09:59Z","title":"Are Longer Prompts Always Better? Prompt Selection in Large Language\n Models for Recommendation Systems","summary":" In large language models (LLM)-based recommendation systems (LLM-RSs),\naccurately predicting user preferences by leveraging the general knowledge of\nLLMs is possible without requiring extensive training data. By converting\nrecommendation tasks into natural language inputs called prompts, LLM-RSs can\nefficiently solve issues that have been difficult to address due to data\nscarcity but are crucial in applications such as cold-start and cross-domain\nproblems. However, when applying this in practice, selecting the prompt that\nmatches tasks and data is essential. Although numerous prompts have been\nproposed in LLM-RSs and representing the target user in prompts significantly\nimpacts recommendation accuracy, there are still no clear guidelines for\nselecting specific prompts.\n In this paper, we categorize and analyze prompts from previous research to\nestablish practical prompt selection guidelines. Through 450 experiments with\n90 prompts and five real-world datasets, we examined the relationship between\nprompts and dataset characteristics in recommendation accuracy. We found that\nno single prompt consistently outperforms others; thus, selecting prompts on\nthe basis of dataset characteristics is crucial. Here, we propose a prompt\nselection method that achieves higher accuracy with minimal validation data.\nBecause increasing the number of prompts to explore raises costs, we also\nintroduce a cost-efficient strategy using high-performance and cost-efficient\nLLMs, significantly reducing exploration costs while maintaining high\nprediction accuracy. Our work offers valuable insights into the prompt\nselection, advancing accurate and efficient LLM-RSs.\n","authors":["Genki Kusano","Kosuke Akimoto","Kunihiro Takeoka"],"pdf_url":"https://arxiv.org/pdf/2412.14454v1.pdf","comment":"15 pages"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.15156v1","updated":"2024-12-19T18:32:21Z","published":"2024-12-19T18:32:21Z","title":"Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned\n LLM","summary":" Text-to-video models have made remarkable advancements through optimization\non high-quality text-video pairs, where the textual prompts play a pivotal role\nin determining quality of output videos. However, achieving the desired output\noften entails multiple revisions and iterative inference to refine\nuser-provided prompts. Current automatic methods for refining prompts encounter\nchallenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware\nwhen applied to text-to-video diffusion models. To address these problem, we\nintroduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video,\nwhich excels in crafting Video-Centric, Labor-Free and Preference-Aligned\nprompts tailored to specific video diffusion model. Our approach involves a\nmeticulously crafted two-stage optimization and alignment system. Initially, we\nconduct a reward-guided prompt evolution pipeline to automatically create\noptimal prompts pool and leverage them for supervised fine-tuning (SFT) of the\nLLM. Then multi-dimensional rewards are employed to generate pairwise data for\nthe SFT model, followed by the direct preference optimization (DPO) algorithm\nto further facilitate preference alignment. Through extensive experimentation\nand comparative analyses, we validate the effectiveness of Prompt-A-Video\nacross diverse generation models, highlighting its potential to push the\nboundaries of video generation.\n","authors":["Yatai Ji","Jiacheng Zhang","Jie Wu","Shilong Zhang","Shoufa Chen","Chongjian GE","Peize Sun","Weifeng Chen","Wenqi Shao","Xuefeng Xiao","Weilin Huang","Ping Luo"],"pdf_url":"https://arxiv.org/pdf/2412.15156v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.15023v1","updated":"2024-12-19T16:37:19Z","published":"2024-12-19T16:37:19Z","title":"Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and\n Semantic Controls","summary":" Sound designers and Foley artists usually sonorize a scene, such as from a\nmovie or video game, by manually annotating and sonorizing each action of\ninterest in the video. In our case, the intent is to leave full creative\ncontrol to sound designers with a tool that allows them to bypass the more\nrepetitive parts of their work, thus being able to focus on the creative\naspects of sound production. We achieve this presenting Stable-V2A, a two-stage\nmodel consisting of: an RMS-Mapper that estimates an envelope representative of\nthe audio characteristics associated with the input video; and Stable-Foley, a\ndiffusion model based on Stable Audio Open that generates audio semantically\nand temporally aligned with the target video. Temporal alignment is guaranteed\nby the use of the envelope as a ControlNet input, while semantic alignment is\nachieved through the use of sound representations chosen by the designer as\ncross-attention conditioning of the diffusion process. We train and test our\nmodel on Greatest Hits, a dataset commonly used to evaluate V2A models. In\naddition, to test our model on a case study of interest, we introduce Walking\nThe Maps, a dataset of videos extracted from video games depicting animated\ncharacters walking in different locations. Samples and code available on our\ndemo page at https://ispamm.github.io/Stable-V2A.\n","authors":["Riccardo Fosco Gramaccioni","Christian Marinoni","Emilian Postolache","Marco Comunità","Luca Cosmo","Joshua D. Reiss","Danilo Comminiello"],"pdf_url":"https://arxiv.org/pdf/2412.15023v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14978v1","updated":"2024-12-19T15:53:21Z","published":"2024-12-19T15:53:21Z","title":"Spectrum-based Modality Representation Fusion Graph Convolutional\n Network for Multimodal Recommendation","summary":" Incorporating multi-modal features as side information has recently become a\ntrend in recommender systems. To elucidate user-item preferences, recent\nstudies focus on fusing modalities via concatenation, element-wise sum, or\nattention mechanisms. Despite having notable success, existing approaches do\nnot account for the modality-specific noise encapsulated within each modality.\nAs a result, direct fusion of modalities will lead to the amplification of\ncross-modality noise. Moreover, the variation of noise that is unique within\neach modality results in noise alleviation and fusion being more challenging.\nIn this work, we propose a new Spectrum-based Modality Representation (SMORE)\nfusion graph recommender that aims to capture both uni-modal and fusion\npreferences while simultaneously suppressing modality noise. Specifically,\nSMORE projects the multi-modal features into the frequency domain and leverages\nthe spectral space for fusion. To reduce dynamic contamination that is unique\nto each modality, we introduce a filter to attenuate and suppress the modality\nnoise adaptively while capturing the universal modality patterns effectively.\nFurthermore, we explore the item latent structures by designing a new\nmulti-modal graph learning module to capture associative semantic correlations\nand universal fusion patterns among similar items. Finally, we formulate a new\nmodality-aware preference module, which infuses behavioral features and\nbalances the uni- and multi-modal features for precise preference modeling.\nThis empowers SMORE with the ability to infer both user modality-specific and\nfusion preferences more accurately. Experiments on three real-world datasets\nshow the efficacy of our proposed model. The source code for this work has been\nmade publicly available at https://github.com/kennethorq/SMORE.\n","authors":["Rongqing Kenneth Ong","Andy W. H. Khong"],"pdf_url":"https://arxiv.org/pdf/2412.14978v1.pdf","comment":"Accepted to ACM Web Search and Data Mining (WSDM) 2025"},{"id":"http://arxiv.org/abs/2310.14778v3","updated":"2024-12-19T11:49:06Z","published":"2023-10-23T10:29:33Z","title":"Audio-Visual Speaker Tracking: Progress, Challenges, and Future\n Directions","summary":" Audio-visual speaker tracking has drawn increasing attention over the past\nfew years due to its academic values and wide application. Audio and visual\nmodalities can provide complementary information for localization and tracking.\nWith audio and visual information, the Bayesian-based filter can solve the\nproblem of data association, audio-visual fusion and track management. In this\npaper, we conduct a comprehensive overview of audio-visual speaker tracking. To\nour knowledge, this is the first extensive survey over the past five years. We\nintroduce the family of Bayesian filters and summarize the methods for\nobtaining audio-visual measurements. In addition, the existing trackers and\ntheir performance on AV16.3 dataset are summarized. In the past few years, deep\nlearning techniques have thrived, which also boosts the development of audio\nvisual speaker tracking. The influence of deep learning techniques in terms of\nmeasurement extraction and state estimation is also discussed. At last, we\ndiscuss the connections between audio-visual speaker tracking and other areas\nsuch as speech separation and distributed speaker tracking.\n","authors":["Jinzheng Zhao","Yong Xu","Xinyuan Qian","Davide Berghi","Peipei Wu","Meng Cui","Jianyuan Sun","Philip J. B. Jackson","Wenwu Wang"],"pdf_url":"https://arxiv.org/pdf/2310.14778v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.14518v1","updated":"2024-12-19T04:33:22Z","published":"2024-12-19T04:33:22Z","title":"Efficient Self-Supervised Video Hashing with Selective State Spaces","summary":" Self-supervised video hashing (SSVH) is a practical task in video indexing\nand retrieval. Although Transformers are predominant in SSVH for their\nimpressive temporal modeling capabilities, they often suffer from computational\nand memory inefficiencies. Drawing inspiration from Mamba, an advanced\nstate-space model, we explore its potential in SSVH to achieve a better balance\nbetween efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing\nmodel with an improved self-supervised learning paradigm. Specifically, we\ndesign bidirectional Mamba layers for both the encoder and decoder, which are\neffective and efficient in capturing temporal relationships thanks to the\ndata-dependent selective scanning mechanism with linear complexity. In our\nlearning strategy, we transform global semantics in the feature space into\nsemantically consistent and discriminative hash centers, followed by a center\nalignment loss as a global learning signal. Our self-local-global (SLG)\nparadigm significantly improves learning efficiency, leading to faster and\nbetter convergence. Extensive experiments demonstrate S5VH's improvements over\nstate-of-the-art methods, superior transferability, and scalable advantages in\ninference efficiency. Code is available at\nhttps://github.com/gimpong/AAAI25-S5VH.\n","authors":["Jinpeng Wang","Niu Lian","Jun Li","Yuting Wang","Yan Feng","Bin Chen","Yongbing Zhang","Shu-Tao Xia"],"pdf_url":"https://arxiv.org/pdf/2412.14518v1.pdf","comment":"Accepted by AAAI'25. 9 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.13609v2","updated":"2024-12-19T03:12:19Z","published":"2024-12-18T08:36:35Z","title":"Sign-IDD: Iconicity Disentangled Diffusion for Sign Language Production","summary":" Sign Language Production (SLP) aims to generate semantically consistent sign\nvideos from textual statements, where the conversion from textual glosses to\nsign poses (G2P) is a crucial step. Existing G2P methods typically treat sign\nposes as discrete three-dimensional coordinates and directly fit them, which\noverlooks the relative positional relationships among joints. To this end, we\nprovide a new perspective, constraining joint associations and gesture details\nby modeling the limb bones to improve the accuracy and naturalness of the\ngenerated poses. In this work, we propose a pioneering iconicity disentangled\ndiffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD\nincorporates a novel Iconicity Disentanglement (ID) module to bridge the gap\nbetween relative positions among joints. The ID module disentangles the\nconventional 3D joint representation into a 4D bone representation, comprising\nthe 3D spatial direction vector and 1D spatial distance vector between adjacent\njoints. Additionally, an Attribute Controllable Diffusion (ACD) module is\nintroduced to further constrain joint associations, in which the attribute\nseparation layer aims to separate the bone direction and length attributes, and\nthe attribute control layer is designed to guide the pose generation by\nleveraging the above attributes. The ACD module utilizes the gloss embeddings\nas semantic conditions and finally generates sign poses from noise embeddings.\nExtensive experiments on PHOENIX14T and USTC-CSL datasets validate the\neffectiveness of our method. The code is available at:\nhttps://github.com/NaVi-start/Sign-IDD.\n","authors":["Shengeng Tang","Jiayi He","Dan Guo","Yanyan Wei","Feng Li","Richang Hong"],"pdf_url":"https://arxiv.org/pdf/2412.13609v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.17847v1","updated":"2024-12-19T01:30:19Z","published":"2024-12-19T01:30:19Z","title":"Bridging the Data Provenance Gap Across Text, Speech and Video","summary":" Progress in AI is driven largely by the scale and quality of training data.\nDespite this, there is a deficit of empirical analysis examining the attributes\nof well-established datasets beyond text. In this work we conduct the largest\nand first-of-its-kind longitudinal audit across modalities--popular text,\nspeech, and video datasets--from their detailed sourcing trends and use\nrestrictions to their geographical and linguistic representation. Our manual\nanalysis covers nearly 4000 public datasets between 1990-2024, spanning 608\nlanguages, 798 sources, 659 organizations, and 67 countries. We find that\nmultimodal machine learning applications have overwhelmingly turned to\nweb-crawled, synthetic, and social media platforms, such as YouTube, for their\ntraining sets, eclipsing all other sources since 2019. Secondly, tracing the\nchain of dataset derivations we find that while less than 33% of datasets are\nrestrictively licensed, over 80% of the source content in widely-used text,\nspeech, and video datasets, carry non-commercial restrictions. Finally, counter\nto the rising number of languages and geographies represented in public AI\ntraining datasets, our audit demonstrates measures of relative geographical and\nmultilingual representation have failed to significantly improve their coverage\nsince 2013. We believe the breadth of our audit enables us to empirically\nexamine trends in data sourcing, restrictions, and Western-centricity at an\necosystem-level, and that visibility into these questions are essential to\nprogress in responsible AI. As a contribution to ongoing improvements in\ndataset transparency and responsible use, we release our entire multimodal\naudit, allowing practitioners to trace data provenance across text, speech, and\nvideo.\n","authors":["Shayne Longpre","Nikhil Singh","Manuel Cherep","Kushagra Tiwary","Joanna Materzynska","William Brannon","Robert Mahari","Manan Dey","Mohammed Hamdy","Nayan Saxena","Ahmad Mustafa Anis","Emad A. Alghamdi","Vu Minh Chien","Naana Obeng-Marnu","Da Yin","Kun Qian","Yizhi Li","Minnie Liang","An Dinh","Shrestha Mohanty","Deividas Mataciunas","Tobin South","Jianguo Zhang","Ariel N. Lee","Campbell S. Lund","Christopher Klamm","Damien Sileo","Diganta Misra","Enrico Shippole","Kevin Klyman","Lester JV Miranda","Niklas Muennighoff","Seonghyeon Ye","Seungone Kim","Vipul Gupta","Vivek Sharma","Xuhui Zhou","Caiming Xiong","Luis Villa","Stella Biderman","Alex Pentland","Sara Hooker","Jad Kabbara"],"pdf_url":"https://arxiv.org/pdf/2412.17847v1.pdf","comment":"10 pages, 5 figures (main paper)"}]}}
\ No newline at end of file
diff --git a/favicon.ico b/favicon.ico
new file mode 100644
index 00000000..7f5166c7
Binary files /dev/null and b/favicon.ico differ
diff --git a/index.css b/index.css
new file mode 100644
index 00000000..9ded9d94
--- /dev/null
+++ b/index.css
@@ -0,0 +1,355 @@
+:root {
+ /* Palette: Nord (https://www.nordtheme.com)*/
+ --nord00: #2e3440;
+ --nord01: #3b4252;
+ --nord02: #434c5e;
+ --nord03: #4c566a;
+ --nord04: #d8dee9;
+ --nord05: #e5e9f0;
+ --nord06: #eceff4;
+ --nord07: #8fbcbb;
+ --nord08: #88c0d0;
+ --nord09: #81a1c1;
+ --nord0A: #5e81ac;
+ --nord0B: #bf616a;
+ --nord0C: #d08770;
+ --nord0D: #ebcb8b;
+ --nord0E: #a3be8c;
+ --nord0F: #b48ead;
+
+
+ /* Typograph */
+ --font-family-default: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue",
+ sans-serif;
+ --font-size-scaler: 62.5%;
+ --font-size-m: 1.6rem;
+ --font-size-s: 1.4rem;
+
+ /* Components */
+ --body-color: var(--nord06);
+ --body-bg: var(--nord00);
+
+ --header-title: var(--nord06);
+ --header-container: var(--nord00);
+ --header-title-preffix: var(--nord0F);
+
+ --chip-font: var(--nord08);
+ --chip-color: var(--nord0B);
+
+ --icons: var(--nord06);
+ --icons-hover: var(--nord0F);
+
+ --day-container: var(--nord01);
+ --date: var(--nord09);
+
+ --summary: var(--nord0E);
+ --summary-hover: var(--nord0F);
+
+ --details-open: var(--nord02);
+ --details-content: var(--nord05);
+ --details-a: var(--nord07);
+ --details-a-hover: var(--nord0F);
+
+ --highlight-title: var(--nord0B);
+ --highlight-author: var(--nord0B);
+
+ --article-summary-hover-color: var(--nord0D);
+ --article-summary-color: var(--nord04);
+
+ --article-title-color: var(--nord05);
+ --article-title-hover-color: var(--nord0E);
+
+ --accordion-content-rail-color: var(--nord01);
+ --accordion-content-hover-rail-color: var(--nord0D);
+ --accordion-title-marker-color: var(--nord01);
+ --accordion-title-hover-marker-color: var(--nord0E);
+
+ --footer-color: var(--nord04);
+ --footer-link-hover-color: var(--nord0D);
+}
+
+[data-theme="light"] {
+ /* Theme design */
+
+ --color-primary: var(--nord07);
+ --color-primary-second: var(--nord00);
+ --color-info: var(--nord0A);
+ --color-success: var(--nord0E);
+ --color-warning: var(--nord0C);
+ --color-danger: var(--nord0B);
+
+ --color-text: var(--nord00);
+ --color-hover: var(--nord0D);
+ --color-shadow: var(--nord03);
+
+ --color-primary-h: var(--nord09);
+ --color-primary-s: var(--nord08);
+ --color-primary-l: var(--nord07);
+
+ --color-contrast-higher-h: var(--nord01);
+ --color-contrast-higher-l: var(--nord02);
+ --color-contrast-higher-s: var(--nord03);
+
+ --color-content: white;
+
+ --background: var(--nord06);
+ --background-content: var(--nord05);
+ --background-color: var(--nord04);
+
+ /* Components */
+
+ --chip-font: var(--nord06);
+ --chip-color: var(--nord09);
+
+ --body-color: var(--background-color);
+ --body-bg: var(--background);
+
+ --header-title: var(--color-shadow);
+ --header-container: var(--background);
+ --header-title-preffix: var(--color-primary-h);
+
+ --icons: var(--color-shadow);
+ --icons-hover: var(--color-hover);
+
+ --day-container: var(--background-content);
+ --date: var(--color-primary-l);
+
+ --summary: var(--color-info);
+ --summary-hover: var(--color-success);
+
+ --details-open: var(--color-content);
+ --details-content: var(--color-text);
+ --details-a: var(--color-primary-h);
+ --details-a-hover: var(--color-hover);
+
+ --highlight-title: var(--color-danger);
+ --highlight-author: var(--color-warning);
+
+ --article-summary-color: var(--color-text);
+ --article-summary-hover-color: var(--color-primary-s);
+
+ --article-title-color: var(--color-primary);
+ --article-title-hover-color: var(--color-success);
+
+ --accordion-content-rail-color: var(--color-warning);
+ --accordion-content-hover-rail-color: var(--color-warning);
+ --accordion-title-marker-color: var(--color-success);
+ --accordion-title-hover-marker-color: var(--color-success);
+
+ --footer-color: var(--color-text);
+ --footer-link-hover-color: var(--color-hover);
+}
+
+html {
+ font-size: var(--font-size-scaler);
+}
+
+body {
+ background-color: var(--body-bg);
+ font-family: var(--font-family-default);
+ color: var(--body-color);
+ margin: 0;
+ padding-top: 16px;
+ display: grid;
+}
+
+.header-container {
+ width: 90%;
+ max-width: 1200px;
+ background: var(--header-container);
+ margin: 0 auto;
+}
+
+.header-title {
+ font-size: 32px;
+ font-weight: bold;
+ color: var(--header-title);
+ margin: 0;
+ padding-bottom: 14px;
+}
+
+.header-title-preffix {
+ color: var(--header-title-preffix);
+}
+
+.icons {
+ color: var(--icons);
+ padding-bottom: 16px;
+}
+
+.icons a {
+ color: var(--icons);
+ text-decoration: none;
+}
+
+.icons a:hover {
+ color: var(--icons-hover);
+}
+
+.day-container {
+ padding: 16px 16px 16px 16px;
+ background: var(--day-container);
+ width: 90%;
+ max-width: 1200px;
+ margin: 0 auto;
+ margin-bottom: 8px;
+ border-radius: 10px;
+}
+
+.date {
+ font-size: 24px;
+ font-weight: 700;
+ margin: 0;
+ color: var(--date);
+}
+
+p {
+ margin: 0;
+}
+
+summary {
+ font-weight: 600;
+ color: var(--summary);
+}
+
+summary:hover {
+ text-decoration: underline;
+ cursor: pointer;
+ color: var(--summary-hover);
+}
+
+details {
+ --border-color: transparent;
+
+ padding: 2px 4px;
+ font-size: 20px;
+ border: 1px solid var(--border-color);
+ border-radius: 4px;
+}
+
+details[open] {
+ background-color: var(--details-open);
+ margin-bottom: 8px;
+}
+
+.details-content {
+ padding: 12px 3px;
+ gap: 16px;
+ color: var(--details-content);
+}
+
+details a {
+ color: var(--details-a);
+}
+
+details a:hover {
+ color: var(--details-a-hover);
+}
+
+footer {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ justify-content: space-between;
+}
+
+.description {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ text-align: center;
+}
+
+.highlight-author {
+ color: var(--highlight-author);
+ font-weight: bold;
+}
+
+.highlight-title {
+ color: var(--highlight-title);
+ font-weight: bold;
+}
+
+.channel-description {
+ text-align: center;
+ font-size: var(--font-size-scaler);
+}
+
+.article-summary-link {
+ color: var(--article-summary-color);
+ font-size: var(--font-size-s);
+ text-decoration: none;
+}
+
+.article-summary-link:hover {
+ color: var(--article-summary-hover-color);
+ --accordion-content-rail-color: var(--accordion-content-hover-rail-color);
+}
+
+.article-summary-box-outer {
+ display: block;
+ padding: 4px 8px 8px 4px;
+}
+
+.article-summary-box-inner {
+ padding-left: 8px;
+ border-left: 1px solid var(--accordion-content-rail-color);
+ font-size: var(--font-size-m);
+}
+
+.article-expander {
+ padding: 10px 4px;
+ border-radius: 4px;
+}
+
+.article-authors {
+ font-size: var(--font-size-m);
+ padding: 0.25em 1em;
+}
+
+.article-authors a {
+ text-decoration: none;
+}
+
+.article-expander-title {
+ font-size: var(--font-size-m);
+ font-weight: 600;
+}
+
+.article-expander-title:hover {
+ cursor: pointer;
+}
+
+.article-expander-title::marker {
+ color: var(--accordion-title-marker-color);
+}
+
+.article-expander-title:hover::marker {
+ color: var(--accordion-title-hover-marker-color);
+}
+
+/* for switcher */
+.theme-switch {
+ display: inline-block;
+ position: relative;
+}
+
+.theme-switch input {
+ display: none;
+}
+
+/* chip */
+.chip {
+ font-size: 90%;
+ align-items: center;
+ color: var(--chip-font);
+ background: var(--chip-color);
+ border-radius: 5rem;
+ display: inline-flex;
+ padding: .2rem .4rem;
+ vertical-align: middle;
+}
\ No newline at end of file
diff --git a/index.html b/index.html
new file mode 100644
index 00000000..a56c4645
--- /dev/null
+++ b/index.html
@@ -0,0 +1,23318 @@
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MyArxiv
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 86
+
+
+
+
+
+ ☆ Long-Form Speech Generation with Spoken Language Models
+
+
+
+
+
+
+
+
+ Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan
+
+
+ We consider the generative modeling of speech over multiple minutes, a
+requirement for long-form multimedia generation and audio-native voice
+assistants. However, current spoken language models struggle to generate
+plausible speech past tens of seconds, from high temporal resolution of speech
+tokens causing loss of coherence, to architectural issues with long-sequence
+training or extrapolation, to memory costs at inference time. With these
+considerations we propose SpeechSSM, the first speech language model to learn
+from and sample long-form spoken audio (e.g., 16 minutes of read or
+extemporaneous speech) in a single decoding session without text intermediates,
+based on recent advances in linear-time sequence modeling. Furthermore, to
+address growing challenges in spoken language evaluation, especially in this
+new long-form setting, we propose: new embedding-based and LLM-judged metrics;
+quality measurements over length and time; and a new benchmark for long-form
+speech processing and generation, LibriSpeech-Long. Speech samples and the
+dataset are released at
+https://google.github.io/tacotron/publications/speechssm/
+
+
+
+
+
+
+
+ ☆ Exploring Embedding Priors in Prompt-Tuning for Improved
+ Interpretability and Control
+
+
+ Prompt-Tuning is an efficient method for adapting pre-trained language models
+to new tasks with minimal computational overhead by modifying prompt
+embeddings. In this work, we investigate how crucial the phenomenon of
+embedding collapse, frequently observed in Prompt-Tuning, is for the final
+performance of the model. To address this question, we designed embedding
+priors and compared them with posteriors of the converged Soft and Deep
+Prompt-Tuning methods. Our findings suggest that priors strongly affect the
+position of the tuned embeddings, and models can effectively work with
+embeddings from different parts of activation spaces, including completely new
+regions. As the final Prompt-Tuning capabilities are limited, we hypothesize
+that controllable Prompt-Tuning posteriors may serve as a good starting point
+for tasks such as chain-of-thought (COT) distillation. Our experiments also
+show that generated trajectories are not localized in the activation space of
+the models. However, there are distinct clusters of activations for distant
+tasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g.,
+Question-Answering and MLM) lie in the same cluster. These observations raise
+questions about the importance of a single activation cluster for the
+generalization abilities of large language models.
+
+
+
+
+
+
+
+ ☆ How Well Do LLMs Generate Code for Different Application Domains?
+ Benchmark and Evaluation
+
+
+ Recently, an increasing number of AI-driven programming assistants powered by
+code LLMs have been integrated into various real-world software development
+environments, significantly boosting developer productivity. However, existing
+code generation benchmarks primarily focus on general-purpose scenarios,
+leaving the code generation performance of LLMs for specific application
+domains largely unknown. In this paper, we introduce a new benchmark,
+MultiCodeBench, to fill this gap. MultiCodeBench comprises 2,400 programming
+tasks, covering 12 popular software development domains and 15 programming
+languages. Specifically, we perform in-depth research to identify these 12
+application domains. Given that each domain may involve multiple technical
+frameworks, and that different frameworks present distinct challenges in the
+coding process, we categorize the commonly used frameworks and platforms within
+each domain. We then sample programming problems from GitHub repositories
+related to these subdomains. To ensure the quality of the tasks and mitigate
+data leakage issues, we invite annotators to rewrite the docstrings for each
+task in MultiCodeBench. Additionally, we build a static analysis-based
+dependency parsing tool to extract the dependencies in the ground truth for
+each task, enabling deeper performance analysis. Through extensive experiments
+on MultiCodeBench with eleven representative mainstream LLMs, we reveal the
+code generation performance of the LLMs across different application domains,
+providing practical insights for developers in downstream fields when selecting
+LLMs. Furthermore, we analyze the reasons behind the models' failures in
+completing software application development tasks, offering guidance for model
+developers to enhance domain-specific code generation capabilities.
+
+
+
+
+
+
+
+ ☆ Zero-resource Speech Translation and Recognition with LLMs ICASSP 2025
+
+
+
+
+
+
+
+
+ Karel Mundnich, Xing Niu, Prashant Mathur, Srikanth Ronanki, Brady Houston, Veera Raghavendra Elluru, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Anshu Bhatia, Daniel Garcia-Romero, Kyu J. Han, Katrin Kirchhoff
+
+
+ Despite recent advancements in speech processing, zero-resource speech
+translation (ST) and automatic speech recognition (ASR) remain challenging
+problems. In this work, we propose to leverage a multilingual Large Language
+Model (LLM) to perform ST and ASR in languages for which the model has never
+seen paired audio-text data. We achieve this by using a pre-trained
+multilingual speech encoder, a multilingual LLM, and a lightweight adaptation
+module that maps the audio representations to the token embedding space of the
+LLM. We perform several experiments both in ST and ASR to understand how to
+best train the model and what data has the most impact on performance in
+previously unseen languages. In ST, our best model is capable to achieve BLEU
+scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we
+achieve WERs of up to 28.2\%. We finally show that the performance of our
+system is bounded by the ability of the LLM to output text in the desired
+language.
+
+
+ Fine-grained sentiment analysis (FSA) aims to extract and summarize user
+opinions from vast opinionated text. Recent studies demonstrate that large
+language models (LLMs) possess exceptional sentiment understanding
+capabilities. However, directly deploying LLMs for FSA applications incurs high
+inference costs. Therefore, this paper investigates the distillation of
+fine-grained sentiment understanding from LLMs into small language models
+(SLMs). We prompt LLMs to examine and interpret the sentiments of given reviews
+and then utilize the generated content to pretrain SLMs. Additionally, we
+develop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. Extensive
+experiments on this benchmark reveal that: (1) distillation significantly
+enhances the performance of SLMs in FSA tasks, achieving a 6.00\% improvement
+in $F_1$-score, and the distilled model can outperform Llama-2-7b with only
+220M parameters; (2) distillation equips SLMs with excellent zero-shot
+sentiment classification capabilities, enabling them to match or even exceed
+their teacher models. These results suggest that distillation from LLMs is a
+highly promising direction for FSA. We will release our code, data, and
+pretrained model weights at
+\url{https://github.com/HITSZ-HLT/FSA-Distillation}.
+
+
+
+
+
+
+
+ ☆ Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard
+ of Safety and Capability
+
+
+ To address this gap, we introduce Libra-Leaderboard, a comprehensive
+framework designed to rank LLMs through a balanced evaluation of performance
+and safety. Combining a dynamic leaderboard with an interactive LLM arena,
+Libra-Leaderboard encourages the joint optimization of capability and safety.
+Unlike traditional approaches that average performance and safety metrics,
+Libra-Leaderboard uses a distance-to-optimal-score method to calculate the
+overall rankings. This approach incentivizes models to achieve a balance rather
+than excelling in one dimension at the expense of some other ones. In the first
+release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading
+organizations, identifying critical safety challenges even in state-of-the-art
+models.
+
+
+ Reasoning is critical for large language models (LLMs) to excel in a wide
+range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM
+performance by decomposing problems into intermediate steps, they also incur
+significant overhead in token usage, leading to increased costs. We find that
+the reasoning process of current LLMs is unnecessarily lengthy and it can be
+compressed by including a reasonable token budget in the prompt, but the choice
+of token budget plays a crucial role in the actual compression effectiveness.
+We then propose a token-budget-aware LLM reasoning framework, which dynamically
+estimates token budgets for different problems based on reasoning complexity
+and uses the estimated token budgets to guide the reasoning process.
+Experiments show that our method effectively reduces token costs in CoT
+reasoning with only a slight performance reduction, offering a practical
+solution to balance efficiency and accuracy in LLM reasoning. Code:
+https://github.com/GeniusHTX/TALE.
+
+
+
+
+
+
+
+ ☆ Consistency Checks for Language Model Forecasters ICLR 2025
+
+
+ Forecasting is a task that is difficult to evaluate: the ground truth can
+only be known in the future. Recent work showing LLM forecasters rapidly
+approaching human-level performance begs the question: how can we benchmark and
+evaluate these forecasters instantaneously? Following the consistency check
+framework, we measure the performance of forecasters in terms of the
+consistency of their predictions on different logically-related questions. We
+propose a new, general consistency metric based on arbitrage: for example, if a
+forecasting AI illogically predicts that both the Democratic and Republican
+parties have 60% probability of winning the 2024 US presidential election, an
+arbitrageur can trade against the forecaster's predictions and make a profit.
+We build an automated evaluation system that generates a set of base questions,
+instantiates consistency checks from these questions, elicits the predictions
+of the forecaster, and measures the consistency of the predictions. We then
+build a standard, proper-scoring-rule forecasting benchmark, and show that our
+(instantaneous) consistency metrics correlate with LLM forecasters' ground
+truth Brier scores (which are only known in the future). We also release a
+consistency benchmark that resolves in 2028, providing a long-term evaluation
+tool for forecasting.
+
+
+ Large Language Models (LLMs) demonstrate remarkable capabilities, yet
+struggle with hallucination and outdated knowledge when tasked with complex
+knowledge reasoning, resulting in factually incorrect outputs. Previous studies
+have attempted to mitigate it by retrieving factual knowledge from large-scale
+knowledge graphs (KGs) to assist LLMs in logical reasoning and prediction of
+answers. However, this kind of approach often introduces noise and irrelevant
+data, especially in situations with extensive context from multiple knowledge
+aspects. In this way, LLM attention can be potentially mislead from question
+and relevant information. In our study, we introduce an Adaptive Multi-Aspect
+Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge
+including entities, relations, and subgraphs, and converts each piece of
+retrieved text into prompt embeddings. The Amar framework comprises two key
+sub-components: 1) a self-alignment module that aligns commonalities among
+entities, relations, and subgraphs to enhance retrieved text, thereby reducing
+noise interference; 2) a relevance gating module that employs a soft gate to
+learn the relevance score between question and multi-aspect retrieved data, to
+determine which information should be used to enhance LLMs' output, or even
+filtered altogether. Our method has achieved state-of-the-art performance on
+two common datasets, WebQSP and CWQ, showing a 1.9\% improvement in accuracy
+over its best competitor and a 6.6\% improvement in logical form generation
+over a method that directly uses retrieved text as context prompts. These
+results demonstrate the effectiveness of Amar in improving the reasoning of
+LLMs.
+
+
+
+ comment: Accepted by AAAI'2025
+
+
+
+
+
+
+ ☆ Characterizations of Language Generation With Breadth
+
+
+ We study language generation in the limit, introduced by Kleinberg and
+Mullainathan [KM24], building on classical works of Gold [Gol67] and Angluin
+[Ang79]. [KM24] proposed an algorithm that generates strings from any countable
+language collection in the limit. While their algorithm eventually outputs
+strings from the target language $K$, it sacrifices breadth, i.e., the ability
+to generate all strings in $K$. A key open question in [KM24] is whether this
+trade-off between consistency and breadth is inherrent.
+ Recent works proposed different notions of consistent generation with
+breadth. Kalavasis, Mehrotra, and Velegkas [KVM24] introduced three
+definitions: generation with exact breadth, approximate breadth, and
+unambiguous generation. Concurrently and independently, Charikar and Pabbaraju
+[CP24a] proposed exhaustive generation. Both works examined when generation
+with these notions of breadth is possible.
+ Building on [CP24a, KVM24], we fully characterize language generation for
+these notions and their natural combinations. For exact breadth, we provide an
+unconditional lower bound, removing a technical condition from [KVM24] and
+extending the result of [CP24a] that holds for specific collections of
+languages. We show that generation with exact breadth is characterized by
+Angluin's condition for identification. We further introduce a weaker version
+of Angluin's condition that tightly characterizes both approximate breadth and
+exhaustive generation, proving their equivalence. Additionally, we show that
+unambiguous generation is also characterized by Angluin's condition as a
+special case of a broader result. Finally, we strengthen [KVM24] by giving
+unconditional lower bounds for stable generators, showing that Angluin's
+condition characterizes the previous breadth notions for stable generators.
+This shows a separation between stable and unstable generation with approximate
+breadth.
+
+
+
+ comment: Abstract shortened to fix arXiv limit
+
+
+
+
+
+
+ ☆ Think or Remember? Detecting and Directing LLMs Towards Memorization or
+ Generalization
+
+
+ In this paper, we explore the foundational mechanisms of memorization and
+generalization in Large Language Models (LLMs), inspired by the functional
+specialization observed in the human brain. Our investigation serves as a case
+study leveraging specially designed datasets and experimental-scale LLMs to lay
+the groundwork for understanding these behaviors. Specifically, we aim to first
+enable LLMs to exhibit both memorization and generalization by training with
+the designed dataset, then (a) examine whether LLMs exhibit neuron-level
+spatial differentiation for memorization and generalization, (b) predict these
+behaviors using model internal representations, and (c) steer the behaviors
+through inference-time interventions. Our findings reveal that neuron-wise
+differentiation of memorization and generalization is observable in LLMs, and
+targeted interventions can successfully direct their behavior.
+
+
+
+
+
+
+
+ ☆ Generating event descriptions under syntactic and semantic constraints
+
+
+
+
+
+
+
+
+ Angela Cao, Faye Holt, Jonas Chan, Stephanie Richter, Lelia Glass, Aaron Steven White
+
+
+ With the goal of supporting scalable lexical semantic annotation, analysis,
+and theorizing, we conduct a comprehensive evaluation of different methods for
+generating event descriptions under both syntactic constraints -- e.g. desired
+clause structure -- and semantic constraints -- e.g. desired verb sense. We
+compare three different methods -- (i) manual generation by experts; (ii)
+sampling from a corpus annotated for syntactic and semantic information; and
+(iii) sampling from a language model (LM) conditioned on syntactic and semantic
+information -- along three dimensions of the generated event descriptions: (a)
+naturalness, (b) typicality, and (c) distinctiveness. We find that all methods
+reliably produce natural, typical, and distinctive event descriptions, but that
+manual generation continues to produce event descriptions that are more
+natural, typical, and distinctive than the automated generation methods. We
+conclude that the automated methods we consider produce event descriptions of
+sufficient quality for use in downstream annotation and analysis insofar as the
+methods used for this annotation and analysis are robust to a small amount of
+degradation in the resulting event descriptions.
+
+
+
+
+
+
+
+ ☆ How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation
+ System? ACL
+
+
+
+
+
+
+
+
+ Sara Papi, Peter Polak, Ondřej Bojar, Dominik Macháček
+
+
+ Simultaneous speech-to-text translation (SimulST) translates source-language
+speech into target-language text concurrently with the speaker's speech,
+ensuring low latency for better user comprehension. Despite its intended
+application to unbounded speech, most research has focused on human
+pre-segmented speech, simplifying the task and overlooking significant
+challenges. This narrow focus, coupled with widespread terminological
+inconsistencies, is limiting the applicability of research outcomes to
+real-world applications, ultimately hindering progress in the field. Our
+extensive literature review of 110 papers not only reveals these critical
+issues in current research but also serves as the foundation for our key
+contributions. We 1) define the steps and core components of a SimulST system,
+proposing a standardized terminology and taxonomy; 2) conduct a thorough
+analysis of community trends, and 3) offer concrete recommendations and future
+directions to bridge the gaps in existing literature, from evaluation
+frameworks to system architectures, for advancing the field towards more
+realistic and effective SimulST solutions.
+
+
+
+
+
+
+
+
+ Shahar Katz, Liran Ringel, Yaniv Romano, Lior Wolf
+
+
+ Modern Language Models (LMs) owe much of their success to masked causal
+attention, the backbone of Generative Pre-Trained Transformer (GPT) models.
+Although GPTs can process the entire user prompt at once, the causal masking is
+applied to all input tokens step-by-step, mimicking the generation process.
+This imposes an unnecessary constraint during the initial "prefill" phase when
+the model processes the input prompt and generates the internal representations
+before producing any output tokens. In this work, attention is masked based on
+the known block structure at the prefill phase, followed by the conventional
+token-by-token autoregressive process after that. For example, in a typical
+chat prompt, the system prompt is treated as one block, and the user prompt as
+the next one. Each of these is treated as a unit for the purpose of masking,
+such that the first tokens in each block can access the subsequent tokens in a
+non-causal manner. Then, the model answer is generated in the conventional
+causal manner. This Segment-by-Segment scheme entails no additional
+computational overhead. When integrating it into models such as Llama and Qwen,
+state-of-the-art performance is consistently achieved.
+
+
+
+
+
+
+
+ ☆ Is Large Language Model Good at Triple Set Prediction? An Empirical
+ Study
+
+
+ The core of the Knowledge Graph Completion (KGC) task is to predict and
+complete the missing relations or nodes in a KG. Common KGC tasks are mostly
+about inferring unknown elements with one or two elements being known in a
+triple. In comparison, the Triple Set Prediction (TSP) task is a more realistic
+knowledge graph completion task. It aims to predict all elements of unknown
+triples based on the information from known triples. In recent years, large
+language models (LLMs) have exhibited significant advancements in language
+comprehension, demonstrating considerable potential for KGC tasks. However, the
+potential of LLM on the TSP task has not yet to be investigated. Thus in this
+paper we proposed a new framework to explore the strengths and limitations of
+LLM in the TSP task. Specifically, the framework consists of LLM-based rule
+mining and LLM-based triple set prediction. The relation list of KG embedded
+within rich semantic information is first leveraged to prompt LLM in the
+generation of rules. This process is both efficient and independent of
+statistical information, making it easier to mine effective and realistic
+rules. For each subgraph, the specified rule is applied in conjunction with the
+relevant triples within that subgraph to guide the LLM in predicting the
+missing triples. Subsequently, the predictions from all subgraphs are
+consolidated to derive the complete set of predicted triples on KG. Finally,
+the method is evaluated on the relatively complete CFamily dataset. The
+experimental results indicate that when LLMs are required to adhere to a large
+amount of factual knowledge to predict missing triples, significant
+hallucinations occurs, leading to a noticeable decline in performance. To
+further explore the causes of this phenomenon, this paper presents a
+comprehensive analysis supported by a detailed case study.
+
+
+
+
+
+
+
+ ☆ Unlocking the Potential of Multiple BERT Models for Bangla Question
+ Answering in NCTB Textbooks
+
+
+
+
+
+
+
+
+ Abdullah Khondoker, Enam Ahmed Taufik, Md Iftekhar Islam Tashik, S M Ishtiak mahmud, Antara Firoz Parsa
+
+
+ Evaluating text comprehension in educational settings is critical for
+understanding student performance and improving curricular effectiveness. This
+study investigates the capability of state-of-the-art language models-RoBERTa
+Base, Bangla-BERT, and BERT Base-in automatically assessing Bangla
+passage-based question-answering from the National Curriculum and Textbook
+Board (NCTB) textbooks for classes 6-10. A dataset of approximately 3,000
+Bangla passage-based question-answering instances was compiled, and the models
+were evaluated using F1 Score and Exact Match (EM) metrics across various
+hyperparameter configurations. Our findings revealed that Bangla-BERT
+consistently outperformed the other models, achieving the highest F1 (0.75) and
+EM (0.53) scores, particularly with smaller batch sizes, the inclusion of stop
+words, and a moderate learning rate. In contrast, RoBERTa Base demonstrated the
+weakest performance, with the lowest F1 (0.19) and EM (0.27) scores under
+certain configurations. The results underscore the importance of fine-tuning
+hyperparameters for optimizing model performance and highlight the potential of
+machine learning models in evaluating text comprehension in educational
+contexts. However, limitations such as dataset size, spelling inconsistencies,
+and computational constraints emphasize the need for further research to
+enhance the robustness and applicability of these models. This study lays the
+groundwork for the future development of automated evaluation systems in
+educational institutions, providing critical insights into model performance in
+the context of Bangla text comprehension.
+
+
+
+
+
+
+
+
+ Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Damien Graux, Dandan Tu, Zeren Jiang, Ruofei Lai, Yang Ren, Jeff Z. Pan
+
+
+ Retrieval-augmented generation systems rely on effective document retrieval
+capabilities. By design, conventional sparse or dense retrievers face
+challenges in multi-hop retrieval scenarios. In this paper, we present GeAR,
+which advances RAG performance through two key innovations: (i) graph
+expansion, which enhances any conventional base retriever, such as BM25, and
+(ii) an agent framework that incorporates graph expansion. Our evaluation
+demonstrates GeAR's superior retrieval performance on three multi-hop question
+answering datasets. Additionally, our system achieves state-of-the-art results
+with improvements exceeding 10% on the challenging MuSiQue dataset, while
+requiring fewer tokens and iterations compared to other multi-step retrieval
+systems.
+
+
+
+
+
+
+
+ ☆ Explainable Multi-Modal Data Exploration in Natural Language via LLM
+ Agent
+
+
+
+
+
+
+
+
+ Farhad Nooralahzadeh, Yi Zhang, Jonathan Furst, Kurt Stockinger
+
+
+ International enterprises, organizations, or hospitals collect large amounts
+of multi-modal data stored in databases, text documents, images, and videos.
+While there has been recent progress in the separate fields of multi-modal data
+exploration as well as in database systems that automatically translate natural
+language questions to database query languages, the research challenge of
+querying database systems combined with other unstructured modalities such as
+images in natural language is widely unexplored.
+ In this paper, we propose XMODE - a system that enables explainable,
+multi-modal data exploration in natural language. Our approach is based on the
+following research contributions: (1) Our system is inspired by a real-world
+use case that enables users to explore multi-modal information systems. (2)
+XMODE leverages a LLM-based agentic AI framework to decompose a natural
+language question into subtasks such as text-to-SQL generation and image
+analysis. (3) Experimental results on multi-modal datasets over relational data
+and images demonstrate that our system outperforms state-of-the-art multi-modal
+exploration systems, excelling not only in accuracy but also in various
+performance metrics such as query latency, API costs, planning efficiency, and
+explanation quality, thanks to the more effective utilization of the reasoning
+capabilities of LLMs.
+
+
+
+
+
+
+
+ ☆ LongDocURL: a Comprehensive Multimodal Long Document Benchmark
+ Integrating Understanding, Reasoning, and Locating
+
+
+
+
+
+
+
+
+ Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu
+
+
+ Large vision language models (LVLMs) have improved the document understanding
+capabilities remarkably, enabling the handling of complex document elements,
+longer contexts, and a wider range of tasks. However, existing document
+understanding benchmarks have been limited to handling only a small number of
+pages and fail to provide a comprehensive analysis of layout elements locating.
+In this paper, we first define three primary task categories: Long Document
+Understanding, numerical Reasoning, and cross-element Locating, and then
+propose a comprehensive benchmark, LongDocURL, integrating above three primary
+tasks and comprising 20 sub-tasks categorized based on different primary tasks
+and answer evidences. Furthermore, we develop a semi-automated construction
+pipeline and collect 2,325 high-quality question-answering pairs, covering more
+than 33,000 pages of documents, significantly outperforming existing
+benchmarks. Subsequently, we conduct comprehensive evaluation experiments on
+both open-source and closed-source models across 26 different configurations,
+revealing critical performance gaps in this field.
+
+
+
+
+
+
+
+ ☆ Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi
+ and English AAAI 2025
+
+
+ Large Language Models (LLMs) excel in linguistic tasks but struggle with
+mathematical reasoning, particularly in non English languages like Hindi. This
+research aims to enhance the mathematical reasoning skills of smaller, resource
+efficient open-source LLMs in both Hindi and English. We evaluate models like
+OpenHathi 7B, LLaMA-2 7B, WizardMath 7B, Mistral 7B, LLeMMa 7B, MAmmoTH 7B,
+Gemini Pro, and GPT-4 using zero-shot, few-shot chain-of-thought (CoT) methods,
+and supervised fine-tuning. Our approach incorporates curriculum learning,
+progressively training models on increasingly difficult problems, a novel
+Decomposition Strategy to simplify complex arithmetic operations, and a
+Structured Solution Design that divides solutions into phases. Our experiments
+result in notable performance enhancements. WizardMath 7B exceeds Gemini's
+accuracy on English datasets by +6% and matches Gemini's performance on Hindi
+datasets. Adopting a bilingual approach that combines English and Hindi samples
+achieves results comparable to individual language models, demonstrating the
+capability to learn mathematical reasoning in both languages. This research
+highlights the potential for improving mathematical reasoning in open-source
+LLMs.
+
+
+
+ comment: Accepted at AAAI 2025
+
+
+
+
+
+
+ ☆ ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with
+ LLM-based Chatbots
+
+
+ The rise of LLMs has deflected a growing portion of human-computer
+interactions towards LLM-based chatbots. The remarkable abilities of these
+models allow users to interact using long, diverse natural language text
+covering a wide range of topics and styles. Phrasing these messages is a time
+and effort consuming task, calling for an autocomplete solution to assist
+users. We introduce the task of chatbot interaction autocomplete. We present
+ChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework
+for LLM-based chatbot interactions. The framework includes a formal definition
+of the task, coupled with suitable datasets and metrics. We use the framework
+to evaluate After formally defining the task along with suitable datasets and
+metrics, we test 9 models on the defined auto completion task, finding that
+while current off-the-shelf models perform fairly, there is still much room for
+improvement, mainly in ranking of the generated suggestions. We provide
+insights for practitioners working on this task and open new research
+directions for researchers in the field. We release our framework to serve as a
+foundation for future research.
+
+
+ This study introduces Bidirectional Topic Matching (BTM), a novel method for
+cross-corpus topic modeling that quantifies thematic overlap and divergence
+between corpora. BTM is a flexible framework that can incorporate various topic
+modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet
+Allocation (LDA). BTM employs a dual-model approach, training separate topic
+models for each corpus and applying them reciprocally to enable comprehensive
+cross-corpus comparisons. This methodology facilitates the identification of
+shared themes and unique topics, providing nuanced insights into thematic
+relationships. Validation against cosine similarity-based methods demonstrates
+the robustness of BTM, with strong agreement metrics and distinct advantages in
+handling outlier topics. A case study on climate news articles showcases BTM's
+utility, revealing significant thematic overlaps and distinctions between
+corpora focused on climate change and climate action. BTM's flexibility and
+precision make it a valuable tool for diverse applications, from political
+discourse analysis to interdisciplinary studies. By integrating shared and
+unique topic analyses, BTM offers a comprehensive framework for exploring
+thematic relationships, with potential extensions to multilingual and dynamic
+datasets. This work highlights BTM's methodological contributions and its
+capacity to advance discourse analysis across various domains.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ☆ Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology
+ Dataset
+
+
+ The field of machine translation has achieved significant advancements, yet
+domain-specific terminology translation, particularly in AI, remains
+challenging. We introduced GIST, a large-scale multilingual AI terminology
+dataset containing 5K terms extracted from top AI conference papers spanning
+2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese,
+and Russian using a hybrid framework that combines LLMs for extraction with
+human expertise for translation. The dataset's quality was benchmarked against
+existing resources, demonstrating superior translation accuracy through
+crowdsourced evaluation. GIST was integrated into translation workflows using
+post-translation refinement methods that required no retraining, where LLM
+prompting consistently improved BLEU and COMET scores. A web demonstration on
+the ACL Anthology platform highlights its practical application, showcasing
+improved accessibility for non-English speakers. This work aims to address
+critical gaps in AI terminology resources and fosters global inclusivity and
+collaboration in AI research.
+
+
+
+
+
+
+
+ ☆ Extracting triples from dialogues for conversational social agents
+
+
+ Obtaining an explicit understanding of communication within a Hybrid
+Intelligence collaboration is essential to create controllable and transparent
+agents. In this paper, we describe a number of Natural Language Understanding
+models that extract explicit symbolic triples from social conversation. Triple
+extraction has mostly been developed and tested for Knowledge Base Completion
+using Wikipedia text and data for training and testing. However, social
+conversation is very different as a genre in which interlocutors exchange
+information in sequences of utterances that involve statements, questions, and
+answers. Phenomena such as co-reference, ellipsis, coordination, and implicit
+and explicit negation or confirmation are more prominent in conversation than
+in Wikipedia text. We therefore describe an attempt to fill this gap by
+releasing data sets for training and testing triple extraction from social
+conversation. We also created five triple extraction models and tested them in
+our evaluation data. The highest precision is 51.14 for complete triples and
+69.32 for triple elements when tested on single utterances. However, scores for
+conversational triples that span multiple turns are much lower, showing that
+extracting knowledge from true conversational data is much more challenging.
+
+
+
+
+
+
+
+ ☆ Multi-Agents Based on Large Language Models for Knowledge-based Visual
+ Question Answering
+
+
+ Large Language Models (LLMs) have achieved impressive results in
+knowledge-based Visual Question Answering (VQA). However existing methods still
+have challenges: the inability to use external tools autonomously, and the
+inability to work in teams. Humans tend to know whether they need to use
+external tools when they encounter a new question, e.g., they tend to be able
+to give a direct answer to a familiar question, whereas they tend to use tools
+such as search engines when they encounter an unfamiliar question. In addition,
+humans also tend to collaborate and discuss with others to get better answers.
+Inspired by this, we propose the multi-agent voting framework. We design three
+LLM-based agents that simulate different levels of staff in a team, and assign
+the available tools according to the levels. Each agent provides the
+corresponding answer, and finally all the answers provided by the agents are
+voted to get the final answer. Experiments on OK-VQA and A-OKVQA show that our
+approach outperforms other baselines by 2.2 and 1.0, respectively.
+
+
+
+
+
+
+
+ ☆ M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models
+
+
+ With the widespread application of Large Language Models (LLMs) in the field
+of Natural Language Processing (NLP), enhancing their performance has become a
+research hotspot. This paper presents a novel multi-prompt ensemble decoding
+approach designed to bolster the generation quality of LLMs by leveraging the
+aggregation of outcomes from multiple prompts. Given a unique input $X$, we
+submit $n$ variations of prompts with $X$ to LLMs in batch mode to decode and
+derive probability distributions. For each token prediction, we calculate the
+ensemble probability by averaging the $n$ probability distributions within the
+batch, utilizing this aggregated probability to generate the token. This
+technique is dubbed Inner-Batch Ensemble. To facilitate efficient batch
+inference, we implement a Left-Padding strategy to maintain uniform input
+lengths across the n prompts. Through extensive experimentation on diverse NLP
+tasks, including machine translation, code generation, and text simplification,
+we demonstrate the efficacy of our method in enhancing LLM performance. The
+results show substantial improvements in BLEU scores, pass@$k$ rates, and LENS
+metrics over conventional methods.
+
+
+
+
+
+
+
+ ☆ DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation
+
+
+
+
+
+
+
+
+ Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, Chun Zuo
+
+
+ Code review is a vital but demanding aspect of software development,
+generating significant interest in automating review comments. Traditional
+evaluation methods for these comments, primarily based on text similarity, face
+two major challenges: inconsistent reliability of human-authored comments in
+open-source projects and the weak correlation of text similarity with
+objectives like enhancing code quality and detecting defects.
+ This study empirically analyzes benchmark comments using a novel set of
+criteria informed by prior research and developer interviews. We then similarly
+revisit the evaluation of existing methodologies. Our evaluation framework,
+DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a
+comprehensive reassessment of current techniques based on the criteria set.
+Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer,
+leveraging the few-shot learning capabilities of LLMs for a target-oriented
+comparison.
+ Our research highlights the limitations of text similarity metrics, finding
+that less than 10% of benchmark comments are high quality for automation. In
+contrast, DeepCRCEval effectively distinguishes between high and low-quality
+comments, proving to be a more reliable evaluation mechanism. Incorporating LLM
+evaluators into DeepCRCEval significantly boosts efficiency, reducing time and
+cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates
+significant potential of focusing task real targets in comment generation.
+
+
+
+ comment: Accepted to the 28th International Conference on Fundamental
+ Approaches to Software Engineering (FASE 2025), part of the 28th European
+ Joint Conferences on Theory and Practice of Software (ETAPS 2025)
+
+
+
+
+
+
+ ☆ GenAI Content Detection Task 2: AI vs. Human -- Academic Essay
+ Authenticity Challenge
+
+
+
+
+
+
+
+
+ Shammur Absar Chowdhury, Hind Almerekhi, Mucahid Kutlu, Kaan Efe Keles, Fatema Ahmad, Tasnim Mohiuddin, George Mikros, Firoj Alam
+
+
+ This paper presents a comprehensive overview of the first edition of the
+Academic Essay Authenticity Challenge, organized as part of the GenAI Content
+Detection shared tasks collocated with COLING 2025. This challenge focuses on
+detecting machine-generated vs. human-authored essays for academic purposes.
+The task is defined as follows: "Given an essay, identify whether it is
+generated by a machine or authored by a human.'' The challenge involves two
+languages: English and Arabic. During the evaluation phase, 25 teams submitted
+systems for English and 21 teams for Arabic, reflecting substantial interest in
+the task. Finally, seven teams submitted system description papers. The
+majority of submissions utilized fine-tuned transformer-based models, with one
+team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This
+paper outlines the task formulation, details the dataset construction process,
+and explains the evaluation framework. Additionally, we present a summary of
+the approaches adopted by participating teams. Nearly all submitted systems
+outperformed the n-gram-based baseline, with the top-performing systems
+achieving F1 scores exceeding 0.98 for both languages, indicating significant
+progress in the detection of machine-generated text.
+
+
+
+ comment: AI Generated Content, Academic Essay, LLMs, Arabic, English
+
+
+
+
+
+
+ ☆ Investigating Large Language Models for Code Vulnerability Detection: An
+ Experimental Study
+
+
+ Code vulnerability detection (CVD) is essential for addressing and preventing
+system security issues, playing a crucial role in ensuring software security.
+Previous learning-based vulnerability detection methods rely on either
+fine-tuning medium-size sequence models or training smaller neural networks
+from scratch. Recent advancements in large pre-trained language models (LLMs)
+have showcased remarkable capabilities in various code intelligence tasks
+including code understanding and generation. However, the effectiveness of LLMs
+in detecting code vulnerabilities is largely under-explored. This work aims to
+investigate the gap by fine-tuning LLMs for the CVD task, involving four
+widely-used open-source LLMs. We also implement other five previous graph-based
+or medium-size sequence models for comparison. Experiments are conducted on
+five commonly-used CVD datasets, including both the part of short samples and
+long samples. In addition, we conduct quantitative experiments to investigate
+the class imbalance issue and the model's performance on samples of different
+lengths, which are rarely studied in previous works. To better facilitate
+communities, we open-source all codes and resources of this study in
+https://github.com/SakiRinn/LLM4CVD and
+https://huggingface.co/datasets/xuefen/VulResource.
+
+
+
+ comment: Under Review
+
+
+
+
+
+
+ ☆ ICM-Assistant: Instruction-tuning Multimodal Large Language Models for
+ Rule-based Explainable Image Content Moderation AAAI 2025
+
+
+ Controversial contents largely inundate the Internet, infringing various
+cultural norms and child protection standards. Traditional Image Content
+Moderation (ICM) models fall short in producing precise moderation decisions
+for diverse standards, while recent multimodal large language models (MLLMs),
+when adopted to general rule-based ICM, often produce classification and
+explanation results that are inconsistent with human moderators. Aiming at
+flexible, explainable, and accurate ICM, we design a novel rule-based dataset
+generation pipeline, decomposing concise human-defined rules and leveraging
+well-designed multi-stage prompts to enrich short explicit image annotations.
+Our ICM-Instruct dataset includes detailed moderation explanation and
+moderation Q-A pairs. Built upon it, we create our ICM-Assistant model in the
+framework of rule-based ICM, making it readily applicable in real practice. Our
+ICM-Assistant model demonstrates exceptional performance and flexibility.
+Specifically, it significantly outperforms existing approaches on various
+sources, improving both the moderation classification (36.8\% on average) and
+moderation explanation quality (26.6\% on average) consistently over existing
+MLLMs. Code/Data is available at https://github.com/zhaoyuzhi/ICM-Assistant.
+
+
+
+
+
+
+
+
+ Zeru Shi, Zhenting Wang, Yongye Su, Weidi Luo, Fan Yang, Yongfeng Zhang
+
+
+ The performance of Large Language Models (LLMs) is based on the quality of
+the prompts and the semantic and structural integrity information of the input
+data. However, current prompt generation methods primarily focus on generating
+prompts for clean input data, often overlooking the impact of perturbed inputs
+on prompt performance. To address this limitation, we propose BATprompt (By
+Adversarial Training prompt), a novel method for prompt generation designed to
+withstand input perturbations (such as typos in the input). Inspired by
+adversarial training techniques, BATprompt demonstrates strong performance on a
+variety of perturbed tasks through a two-step process: adversarial perturbation
+and iterative optimization on unperturbed input via LLM. Unlike conventional
+adversarial attack methods, BATprompt avoids reliance on real gradients or
+model parameters. Instead, it leverages the advanced reasoning, language
+understanding and self reflection capabilities of LLMs to simulate gradients,
+guiding the generation of adversarial perturbations and optimizing prompt
+performance. In our experiments, we evaluate BATprompt on multiple datasets
+across both language understanding and generation tasks. The results indicate
+that BATprompt outperforms existing prompt generation methods, delivering
+superior robustness and performance under diverse perturbation scenarios.
+
+
+
+
+
+
+
+ ☆ VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics
+ Manipulation with Long-Horizon Reasoning Tasks
+
+
+ General-purposed embodied agents are designed to understand the users'
+natural instructions or intentions and act precisely to complete universal
+tasks. Recently, methods based on foundation models especially
+Vision-Language-Action models (VLAs) have shown a substantial potential to
+solve language-conditioned manipulation (LCM) tasks well. However, existing
+benchmarks do not adequately meet the needs of VLAs and relative algorithms. To
+better define such general-purpose tasks in the context of LLMs and advance the
+research in VLAs, we present VLABench, an open-source benchmark for evaluating
+universal LCM task learning. VLABench provides 100 carefully designed
+categories of tasks, with strong randomization in each category of task and a
+total of 2000+ objects. VLABench stands out from previous benchmarks in four
+key aspects: 1) tasks requiring world knowledge and common sense transfer, 2)
+natural language instructions with implicit human intentions rather than
+templates, 3) long-horizon tasks demanding multi-step reasoning, and 4)
+evaluation of both action policies and language model capabilities. The
+benchmark assesses multiple competencies including understanding of
+mesh\&texture, spatial relationship, semantic instruction, physical laws,
+knowledge transfer and reasoning, etc. To support the downstream finetuning, we
+provide high-quality training data collected via an automated framework
+incorporating heuristic skills and prior information. The experimental results
+indicate that both the current state-of-the-art pretrained VLAs and the
+workflow based on VLMs face challenges in our tasks.
+
+
+
+
+
+
+
+ ☆ An Analysis on Automated Metrics for Evaluating Japanese-English Chat
+ Translation
+
+
+ This paper analyses how traditional baseline metrics, such as BLEU and TER,
+and neural-based methods, such as BERTScore and COMET, score several NMT models
+performance on chat translation and how these metrics perform when compared to
+human-annotated scores. The results show that for ranking NMT models in chat
+translations, all metrics seem consistent in deciding which model outperforms
+the others. This implies that traditional baseline metrics, which are faster
+and simpler to use, can still be helpful. On the other hand, when it comes to
+better correlation with human judgment, neural-based metrics outperform
+traditional metrics, with COMET achieving the highest correlation with the
+human-annotated score on a chat translation. However, we show that even the
+best metric struggles when scoring English translations from sentences with
+anaphoric zero-pronoun in Japanese.
+
+
+
+ comment: Accepted at the 29th Annual Meeting of the Association for Natural
+ Language Processing (NLP2023). Published version available at
+ https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/A8-1.pdf
+
+
+
+
+
+
+ ☆ On the Applicability of Zero-Shot Cross-Lingual Transfer Learning for
+ Sentiment Classification in Distant Language Pairs
+
+
+ This research explores the applicability of cross-lingual transfer learning
+from English to Japanese and Indonesian using the XLM-R pre-trained model. The
+results are compared with several previous works, either by models using a
+similar zero-shot approach or a fully-supervised approach, to provide an
+overview of the zero-shot transfer learning approach's capability using XLM-R
+in comparison with existing models. Our models achieve the best result in one
+Japanese dataset and comparable results in other datasets in Japanese and
+Indonesian languages without being trained using the target language.
+Furthermore, the results suggest that it is possible to train a multi-lingual
+model, instead of one model for each language, and achieve promising results.
+
+
+
+ comment: Accepted at the 28th Annual Meeting of the Association for Natural
+ Language Processing (NLP2022). Published version available at
+ https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/A6-1.pdf
+
+
+
+
+
+
+ ☆ Survey of Pseudonymization, Abstractive Summarization & Spell Checker
+ for Hindi and Marathi
+
+
+ India's vast linguistic diversity presents unique challenges and
+opportunities for technological advancement, especially in the realm of Natural
+Language Processing (NLP). While there has been significant progress in NLP
+applications for widely spoken languages, the regional languages of India, such
+as Marathi and Hindi, remain underserved. Research in the field of NLP for
+Indian regional languages is at a formative stage and holds immense
+significance. The paper aims to build a platform which enables the user to use
+various features like text anonymization, abstractive text summarization and
+spell checking in English, Hindi and Marathi language. The aim of these tools
+is to serve enterprise and consumer clients who predominantly use Indian
+Regional Languages.
+
+
+
+
+
+
+
+ ☆ scReader: Prompting Large Language Models to Interpret scRNA-seq Data ICDM 2024
+
+
+ Large language models (LLMs) have demonstrated remarkable advancements,
+primarily due to their capabilities in modeling the hidden relationships within
+text sequences. This innovation presents a unique opportunity in the field of
+life sciences, where vast collections of single-cell omics data from multiple
+species provide a foundation for training foundational models. However, the
+challenge lies in the disparity of data scales across different species,
+hindering the development of a comprehensive model for interpreting genetic
+data across diverse organisms. In this study, we propose an innovative hybrid
+approach that integrates the general knowledge capabilities of LLMs with
+domain-specific representation models for single-cell omics data
+interpretation. We begin by focusing on genes as the fundamental unit of
+representation. Gene representations are initialized using functional
+descriptions, leveraging the strengths of mature language models such as
+LLaMA-2. By inputting single-cell gene-level expression data with prompts, we
+effectively model cellular representations based on the differential expression
+levels of genes across various species and cell types. In the experiments, we
+constructed developmental cells from humans and mice, specifically targeting
+cells that are challenging to annotate. We evaluated our methodology through
+basic tasks such as cell annotation and visualization analysis. The results
+demonstrate the efficacy of our approach compared to other methods using LLMs,
+highlighting significant improvements in accuracy and interoperability. Our
+hybrid approach enhances the representation of single-cell data and offers a
+robust framework for future research in cross-species genetic analysis.
+
+
+
+ comment: 8 pages, Accepted by ICDM 2024
+
+
+
+
+
+
+ ☆ GeneSUM: Large Language Model-based Gene Summary Extraction
+
+
+ Emerging topics in biomedical research are continuously expanding, providing
+a wealth of information about genes and their function. This rapid
+proliferation of knowledge presents unprecedented opportunities for scientific
+discovery and formidable challenges for researchers striving to keep abreast of
+the latest advancements. One significant challenge is navigating the vast
+corpus of literature to extract vital gene-related information, a
+time-consuming and cumbersome task. To enhance the efficiency of this process,
+it is crucial to address several key challenges: (1) the overwhelming volume of
+literature, (2) the complexity of gene functions, and (3) the automated
+integration and generation. In response, we propose GeneSUM, a two-stage
+automated gene summary extractor utilizing a large language model (LLM). Our
+approach retrieves and eliminates redundancy of target gene literature and then
+fine-tunes the LLM to refine and streamline the summarization process. We
+conducted extensive experiments to validate the efficacy of our proposed
+framework. The results demonstrate that LLM significantly enhances the
+integration of gene-specific information, allowing more efficient
+decision-making in ongoing research.
+
+
+
+ comment: 7 pages, Accepted by BIBM 2024
+
+
+
+
+
+
+ ☆ CoAM: Corpus of All-Type Multiword Expressions
+
+
+ Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.
+MWE identification, i.e., detecting MWEs in text, can play a key role in
+downstream tasks such as machine translation. Existing datasets for MWE
+identification are inconsistently annotated, limited to a single type of MWE,
+or limited in size. To enable reliable and comprehensive evaluation, we created
+CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences
+constructed through a multi-step process to enhance data quality consisting of
+human annotation, human review, and automated consistency checking. MWEs in
+CoAM are tagged with MWE types, such as Noun and Verb, to enable fine-grained
+error analysis. Annotations for CoAM were collected using a new interface
+created with our interface generator, which allows easy and flexible annotation
+of MWEs in any form, including discontinuous ones. Through experiments using
+CoAM, we find that a fine-tuned large language model outperforms the current
+state-of-the-art approach for MWE identification. Furthermore, analysis using
+our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to
+identify across approaches.
+
+
+
+
+
+
+
+ ☆ Are We in the AI-Generated Text World Already? Quantifying and
+ Monitoring AIGT on Social Media
+
+
+
+
+
+
+
+
+ Zhen Sun, Zongmin Zhang, Xinyue Shen, Ziyi Zhang, Yule Liu, Michael Backes, Yang Zhang, Xinlei He
+
+
+ Social media platforms are experiencing a growing presence of AI-Generated
+Texts (AIGTs). However, the misuse of AIGTs could have profound implications
+for public opinion, such as spreading misinformation and manipulating
+narratives. Despite its importance, a systematic study to assess the prevalence
+of AIGTs on social media is still lacking. To address this gap, this paper aims
+to quantify, monitor, and analyze the AIGTs on online social media platforms.
+We first collect a dataset (SM-D) with around 2.4M posts from 3 major social
+media platforms: Medium, Quora, and Reddit. Then, we construct a diverse
+dataset (AIGTBench) to train and evaluate AIGT detectors. AIGTBench combines
+popular open-source datasets and our AIGT datasets generated from social media
+texts by 12 LLMs, serving as a benchmark for evaluating mainstream detectors.
+With this setup, we identify the best-performing detector (OSM-Det). We then
+apply OSM-Det to SM-D to track AIGTs over time and observe different trends of
+AI Attribution Rate (AAR) across social media platforms from January 2022 to
+October 2024. Specifically, Medium and Quora exhibit marked increases in AAR,
+rising from 1.77% to 37.03% and 2.06% to 38.95%, respectively. In contrast,
+Reddit shows slower growth, with AAR increasing from 1.31% to 2.45% over the
+same period. Our further analysis indicates that AIGTs differ from
+human-written texts across several dimensions, including linguistic patterns,
+topic distributions, engagement levels, and the follower distribution of
+authors. We envision our analysis and findings on AIGTs in social media can
+shed light on future research in this domain.
+
+
+ The in-image machine translation task involves translating text embedded
+within images, with the translated results presented in image format. While
+this task has numerous applications in various scenarios such as film poster
+translation and everyday scene image translation, existing methods frequently
+neglect the aspect of consistency throughout this process. We propose the need
+to uphold two types of consistency in this task: translation consistency and
+image generation consistency. The former entails incorporating image
+information during translation, while the latter involves maintaining
+consistency between the style of the text-image and the original image,
+ensuring background integrity. To address these consistency requirements, we
+introduce a novel two-stage framework named HCIIT (High-Consistency In-Image
+Translation) which involves text-image translation using a multimodal
+multilingual large language model in the first stage and image backfilling with
+a diffusion model in the second stage. Chain of thought learning is utilized in
+the first stage to enhance the model's ability to leverage image information
+during translation. Subsequently, a diffusion model trained for
+style-consistent text-image generation ensures uniformity in text style within
+images and preserves background details. A dataset comprising 400,000
+style-consistent pseudo text-image pairs is curated for model training. Results
+obtained on both curated test sets and authentic image test sets validate the
+effectiveness of our framework in ensuring consistency and producing
+high-quality translated images.
+
+
+
+
+
+
+
+ ☆ LSAQ: Layer-Specific Adaptive Quantization for Large Language Model
+ Deployment
+
+
+
+
+
+
+
+
+ Binrui Zeng, Bin Ji, Xiaodong Liu, Jie Yu, Shasha Li, Jun Ma, Xiaopeng Li, Shangwen Wang, Xinran Hong
+
+
+ As large language models (LLMs) demonstrate exceptional performance across
+various domains, the deployment of these models on edge devices has emerged as
+a new trend. Quantization techniques, which reduce the size and memory
+footprint of LLMs, are effective for enabling deployment on
+resource-constrained edge devices. However, existing one-size-fits-all
+quantization methods often fail to dynamically adjust the memory consumption of
+LLMs based on specific hardware characteristics and usage scenarios. To address
+this limitation, we propose LSAQ (Layer-Specific Adaptive Quantization), a
+system for adaptive quantization and dynamic deployment of LLMs based on layer
+importance. LSAQ evaluates layer importance by constructing top-k token sets
+from the inputs and outputs of each layer and calculating their Jaccard
+coefficient. Using this evaluation, the system adaptively adjusts quantization
+strategies in real time according to the resource availability of edge devices,
+assigning different precision levels to layers of varying importance. This
+approach significantly reduces the storage requirements of LLMs while
+maintaining model performance, enabling efficient deployment across diverse
+hardware platforms and usage scenarios.
+
+
+
+ comment: 8 pages, 4 figures, work in progress
+
+
+
+
+
+
+ ☆ AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image
+ Models
+
+
+ As text-to-image (T2I) models continue to advance and gain widespread
+adoption, their associated safety issues are becoming increasingly prominent.
+Malicious users often exploit these models to generate Not-Safe-for-Work (NSFW)
+images using harmful or adversarial prompts, highlighting the critical need for
+robust safeguards to ensure the integrity and compliance of model outputs.
+Current internal safeguards frequently degrade image quality, while external
+detection methods often suffer from low accuracy and inefficiency.
+ In this paper, we introduce AEIOU, a defense framework that is Adaptable,
+Efficient, Interpretable, Optimizable, and Unified against NSFW prompts in T2I
+models. AEIOU extracts NSFW features from the hidden states of the model's text
+encoder, utilizing the separable nature of these features to detect NSFW
+prompts. The detection process is efficient, requiring minimal inference time.
+AEIOU also offers real-time interpretation of results and supports optimization
+through data augmentation techniques. The framework is versatile, accommodating
+various T2I architectures. Our extensive experiments show that AEIOU
+significantly outperforms both commercial and open-source moderation tools,
+achieving over 95% accuracy across all datasets and improving efficiency by at
+least tenfold. It effectively counters adaptive attacks and excels in few-shot
+and multi-label scenarios.
+
+
+
+
+
+
+
+ ☆ Do Language Models Understand the Cognitive Tasks Given to Them?
+ Investigations with the N-Back Paradigm
+
+
+ Cognitive tasks originally developed for humans are now increasingly used to
+study language models. While applying these tasks is often straightforward,
+interpreting their results can be challenging. In particular, when a model
+underperforms, it's often unclear whether this results from a limitation in the
+cognitive ability being tested or a failure to understand the task itself. A
+recent study argued that GPT 3.5's declining performance on 2-back and 3-back
+tasks reflects a working memory capacity limit similar to humans. By analyzing
+a range of open-source language models of varying performance levels on these
+tasks, we show that the poor performance instead reflects a limitation in task
+comprehension and task set maintenance. In addition, we push the best
+performing model to higher n values and experiment with alternative prompting
+strategies, before analyzing model attentions. Our larger aim is to contribute
+to the ongoing conversation around refining methodologies for the cognitive
+evaluation of language models.
+
+
+
+
+
+
+
+ ☆ Molly: Making Large Language Model Agents Solve Python Problem More
+ Logically
+
+
+
+
+
+
+
+
+ Rui Xiao, Jiong Wang, Lu Han, Na Zong, Han Wu
+
+
+ Applying large language models (LLMs) as teaching assists has attracted much
+attention as an integral part of intelligent education, particularly in
+computing courses. To reduce the gap between the LLMs and the computer
+programming education expert, fine-tuning and retrieval augmented generation
+(RAG) are the two mainstream methods in existing researches. However,
+fine-tuning for specific tasks is resource-intensive and may diminish the
+model`s generalization capabilities. RAG can perform well on reducing the
+illusion of LLMs, but the generation of irrelevant factual content during
+reasoning can cause significant confusion for learners. To address these
+problems, we introduce the Molly agent, focusing on solving the proposed
+problem encountered by learners when learning Python programming language. Our
+agent automatically parse the learners' questioning intent through a
+scenario-based interaction, enabling precise retrieval of relevant documents
+from the constructed knowledge base. At generation stage, the agent reflect on
+the generated responses to ensure that they not only align with factual content
+but also effectively answer the user's queries. Extensive experimentation on a
+constructed Chinese Python QA dataset shows the effectiveness of the Molly
+agent, indicating an enhancement in its performance for providing useful
+responses to Python questions.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2402.07913
+
+ Motion planning is a crucial component in autonomous driving.
+State-of-the-art motion planners are trained on meticulously curated datasets,
+which are not only expensive to annotate but also insufficient in capturing
+rarely seen critical scenarios. Failing to account for such scenarios poses a
+significant risk to motion planners and may lead to incidents during testing.
+An intuitive solution is to manually compose such scenarios by programming and
+executing a simulator (e.g., CARLA). However, this approach incurs substantial
+human costs. Motivated by this, we propose an inexpensive method for generating
+diverse critical traffic scenarios to train more robust motion planners. First,
+we represent traffic scenarios as scripts, which are then used by the simulator
+to generate traffic scenarios. Next, we develop a method that accepts
+user-specified text descriptions, which a Large Language Model (LLM) translates
+into scripts using in-context learning. The output scripts are sent to the
+simulator that produces the corresponding traffic scenarios. As our method can
+generate abundant safety-critical traffic scenarios, we use them as synthetic
+training data for motion planners. To demonstrate the value of generated
+scenarios, we train existing motion planners on our synthetic data, real-world
+datasets, and a combination of both. Our experiments show that motion planners
+trained with our data significantly outperform those trained solely on
+real-world data, showing the usefulness of our synthetic data and the
+effectiveness of our data generation method. Our source code is available at
+https://ezharjan.github.io/AutoSceneGen.
+
+
+
+
+
+
+
+ ☆ MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
+
+
+ With advances in foundational and vision-language models, and effective
+fine-tuning techniques, a large number of both general and special-purpose
+models have been developed for a variety of visual tasks. Despite the
+flexibility and accessibility of these models, no single model is able to
+handle all tasks and/or applications that may be envisioned by potential users.
+Recent approaches, such as visual programming and multimodal LLMs with
+integrated tools aim to tackle complex visual tasks, by way of program
+synthesis. However, such approaches overlook user constraints (e.g.,
+performance / computational needs), produce test-time sample-specific solutions
+that are difficult to deploy, and, sometimes, require low-level instructions
+that maybe beyond the abilities of a naive user. To address these limitations,
+we introduce MMFactory, a universal framework that includes model and metrics
+routing components, acting like a solution search engine across various
+available models. Based on a task description and few sample input-output pairs
+and (optionally) resource and/or performance constraints, MMFactory can suggest
+a diverse pool of programmatic solutions by instantiating and combining
+visio-lingual tools from its model repository. In addition to synthesizing
+these solutions, MMFactory also proposes metrics and benchmarks performance /
+resource characteristics, allowing users to pick a solution that meets their
+unique design constraints. From the technical perspective, we also introduced a
+committee-based solution proposer that leverages multi-agent LLM conversation
+to generate executable, diverse, universal, and robust solutions for the user.
+Experimental results show that MMFactory outperforms existing methods by
+delivering state-of-the-art solutions tailored to user problem specifications.
+Project page is available at https://davidhalladay.github.io/mmfactory_demo.
+
+
+
+
+
+
+
+ ☆ Improving Factuality with Explicit Working Memory
+
+
+ Large language models can generate factually inaccurate content, a problem
+known as hallucination. Recent works have built upon retrieved-augmented
+generation to improve factuality through iterative prompting but these methods
+are limited by the traditional RAG design. To address these challenges, we
+introduce EWE (Explicit Working Memory), a novel approach that enhances
+factuality in long-form text generation by integrating a working memory that
+receives real-time feedback from external resources. The memory is refreshed
+based on online fact-checking and retrieval feedback, allowing EWE to rectify
+false claims during the generation process and ensure more accurate and
+reliable outputs. Our experiments demonstrate that Ewe outperforms strong
+baselines on four fact-seeking long-form generation datasets, increasing the
+factuality metric, VeriScore, by 2 to 10 points absolute without sacrificing
+the helpfulness of the responses. Further analysis reveals that the design of
+rules for memory updates, configurations of memory units, and the quality of
+the retrieval datastore are crucial factors for influencing model performance.
+
+
+
+
+
+
+
+ ☆ Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction
+
+
+ Turn-taking prediction is the task of anticipating when the speaker in a
+conversation will yield their turn to another speaker to begin speaking. This
+project expands on existing strategies for turn-taking prediction by employing
+a multi-modal ensemble approach that integrates large language models (LLMs)
+and voice activity projection (VAP) models. By combining the linguistic
+capabilities of LLMs with the temporal precision of VAP models, we aim to
+improve the accuracy and efficiency of identifying TRPs in both scripted and
+unscripted conversational scenarios. Our methods are evaluated on the
+In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation
+(CCPE) datasets, highlighting the strengths and limitations of current models
+while proposing a potentially more robust framework for enhanced prediction.
+
+
+
+
+
+
+
+ ☆ Neuron Empirical Gradient: Connecting Neurons' Linear Controllability
+ and Representational Capacity
+
+
+ Although neurons in the feed-forward layers of pre-trained language models
+(PLMs) can store factual knowledge, most prior analyses remain qualitative,
+leaving the quantitative relationship among knowledge representation, neuron
+activations, and model output poorly understood. In this study, by performing
+neuron-wise interventions using factual probing datasets, we first reveal the
+linear relationship between neuron activations and output token probabilities.
+We refer to the gradient of this linear relationship as ``neuron empirical
+gradients.'' and propose NeurGrad, an efficient method for their calculation to
+facilitate quantitative neuron analysis. We next investigate whether neuron
+empirical gradients in PLMs encode general task knowledge by probing skill
+neurons. To this end, we introduce MCEval8k, a multi-choice knowledge
+evaluation benchmark spanning six genres and 22 tasks. Our experiments confirm
+that neuron empirical gradients effectively capture knowledge, while skill
+neurons exhibit efficiency, generality, inclusivity, and interdependency. These
+findings link knowledge to PLM outputs via neuron empirical gradients, shedding
+light on how PLMs store knowledge. The code and dataset are released.
+
+
+
+ comment: 29 pages, 18 figures
+
+
+
+
+
+
+ ♻ ☆ Tokens, the oft-overlooked appetizer: Large language models, the
+ distributional hypothesis, and meaning
+
+
+
+
+
+
+
+
+ Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
+
+
+ Tokenization is a necessary component within the current architecture of many
+language models, including the transformer-based large language models (LLMs)
+of Generative AI, yet its impact on the model's cognition is often overlooked.
+We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is
+sufficient for reasonably human-like language performance, and that the
+emergence of human-meaningful linguistic units among tokens motivates
+linguistically-informed interventions in existing, linguistically-agnostic
+tokenization techniques, particularly with respect to their roles as (1)
+semantic primitives and as (2) vehicles for conveying salient distributional
+patterns from human language to the model. We explore tokenizations from a BPE
+tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken;
+and the information in exemplar token vectors as they move through the layers
+of a RoBERTa (large) model. Besides creating sub-optimal semantic building
+blocks and obscuring the model's access to the necessary distributional
+patterns, we describe how tokenization pretraining can be a backdoor for bias
+and other unwanted content, which current alignment practices may not
+remediate. Additionally, we relay evidence that the tokenization algorithm's
+objective function impacts the LLM's cognition, despite being meaningfully
+insulated from the main system intelligence.
+
+
+
+
+
+
+
+ ♻ ☆ Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models
+ into Assembly Code Obfuscation AAAI 2025
+
+
+ Malware authors often employ code obfuscations to make their malware harder
+to detect. Existing tools for generating obfuscated code often require access
+to the original source code (e.g., C++ or Java), and adding new obfuscations is
+a non-trivial, labor-intensive process. In this study, we ask the following
+question: Can Large Language Models (LLMs) potentially generate a new
+obfuscated assembly code? If so, this poses a risk to anti-virus engines and
+potentially increases the flexibility of attackers to create new obfuscation
+patterns. We answer this in the affirmative by developing the MetamorphASM
+benchmark comprising MetamorphASM Dataset (MAD) along with three code
+obfuscation techniques: dead code, register substitution, and control flow
+change. The MetamorphASM systematically evaluates the ability of LLMs to
+generate and analyze obfuscated code using MAD, which contains 328,200
+obfuscated assembly code samples. We release this dataset and analyze the
+success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder,
+CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly
+code. The evaluation was performed using established information-theoretic
+metrics and manual human review to ensure correctness and provide the
+foundation for researchers to study and develop remediations to this risk. The
+source code can be found at the following GitHub link:
+https://github.com/mohammadi-ali/MetamorphASM.
+
+
+
+ comment: To appear in AAAI 2025, Main Track
+
+
+
+
+
+
+ ♻ ☆ Evaluating Zero-Shot Multilingual Aspect-Based Sentiment Analysis with
+ Large Language Models
+
+
+ Aspect-based sentiment analysis (ABSA), a sequence labeling task, has
+attracted increasing attention in multilingual contexts. While previous
+research has focused largely on fine-tuning or training models specifically for
+ABSA, we evaluate large language models (LLMs) under zero-shot conditions to
+explore their potential to tackle this challenge with minimal task-specific
+adaptation. We conduct a comprehensive empirical evaluation of a series of LLMs
+on multilingual ABSA tasks, investigating various prompting strategies,
+including vanilla zero-shot, chain-of-thought (CoT), self-improvement,
+self-debate, and self-consistency, across nine different models. Results
+indicate that while LLMs show promise in handling multilingual ABSA, they
+generally fall short of fine-tuned, task-specific models. Notably, simpler
+zero-shot prompts often outperform more complex strategies, especially in
+high-resource languages like English. These findings underscore the need for
+further refinement of LLM-based approaches to effectively address ABSA task
+across diverse languages.
+
+
+
+
+
+
+
+ ♻ ☆ SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking
+ State Space Models
+
+
+ Known as low energy consumption networks, spiking neural networks (SNNs) have
+gained a lot of attention within the past decades. While SNNs are increasing
+competitive with artificial neural networks (ANNs) for vision tasks, they are
+rarely used for long sequence tasks, despite their intrinsic temporal dynamics.
+In this work, we develop spiking state space models (SpikingSSMs) for long
+sequence learning by leveraging on the sequence learning abilities of state
+space models (SSMs). Inspired by dendritic neuron structure, we hierarchically
+integrate neuronal dynamics with the original SSM block, meanwhile realizing
+sparse synaptic computation. Furthermore, to solve the conflict of event-driven
+neuronal dynamics with parallel computing, we propose a light-weight surrogate
+dynamic network which accurately predicts the after-reset membrane potential
+and compatible to learnable thresholds, enabling orders of acceleration in
+training speed compared with conventional iterative methods. On the long range
+arena benchmark task, SpikingSSM achieves competitive performance to
+state-of-the-art SSMs meanwhile realizing on average 90\% of network sparsity.
+On language modeling, our network significantly surpasses existing spiking
+large language models (spikingLLMs) on the WikiText-103 dataset with only a
+third of the model size, demonstrating its potential as backbone architecture
+for low computation cost LLMs.
+
+
+
+
+
+
+
+ ♻ ☆ YuLan-Mini: An Open Data-efficient Language Model
+
+
+ Effective pre-training of large language models (LLMs) has been challenging
+due to the immense resource demands and the complexity of the technical
+processes involved. This paper presents a detailed technical report on
+YuLan-Mini, a highly capable base model with 2.42B parameters that achieves
+top-tier performance among models of similar parameter scale. Our pre-training
+approach focuses on enhancing training efficacy through three key technical
+contributions: an elaborate data pipeline combines data cleaning with data
+schedule strategies, a robust optimization method to mitigate training
+instability, and an effective annealing approach that incorporates targeted
+data selection and long context training. Remarkably, YuLan-Mini, trained on
+1.08T tokens, achieves performance comparable to industry-leading models that
+require significantly more data. To facilitate reproduction, we release the
+full details of the data composition for each training phase. Project details
+can be accessed at the following link: https://github.com/RUC-GSAI/YuLan-Mini.
+
+
+
+
+
+
+
+
+ Shaina Raza, Mizanur Rahman, Michael R. Zhang
+
+
+ Recent advancements in large language models (LLMs) have greatly enhanced
+natural language processing (NLP) applications. Nevertheless, these models
+often inherit biases from their training data. Despite the availability of
+various datasets for bias detection, most are limited to one or two NLP tasks
+(typically classification or evaluation) and lack comprehensive evaluations
+across a broader range of NLP tasks. To address this gap, we introduce the Bias
+Evaluations Across Domains BEADs dataset, designed to support a wide array of
+NLP tasks, including text classification, token classification, bias
+quantification, and benign language generation. A key focus of this paper is
+the gold label dataset that is annotated by GPT4 for scalabilty and verified by
+experts to ensure high reliability. BEADs provides data for both fine-tuning,
+including classification and language generation tasks, and for evaluating
+LLMs. Our findings indicate that BEADs effectively identifies numerous biases
+when fine-tuned on this dataset. It also reduces biases when used for
+fine-tuning language generation task, while preserving language quality. The
+results also reveal some prevalent demographic biases in LLMs when BEADs is
+used for evaluation in demographic task. We provide the BEADs dataset for
+detecting biases in various domains, and this dataset is readily usable for
+responsible AI development and application. The dataset can be accessed at
+https://huggingface.co/datasets/shainar/BEAD .
+
+
+
+
+
+
+
+
+ Wen Cheng, Ke Sun, Xinyu Zhang, Wei Wang
+
+
+ The rapid development of large language models (LLMs) has significantly
+advanced code completion capabilities, giving rise to a new generation of
+LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these
+tools possess unique workflows, integrating multiple information sources as
+input and prioritizing code suggestions over natural language interaction,
+which introduces distinct security challenges. Additionally, LCCTs often rely
+on proprietary code datasets for training, raising concerns about the potential
+exposure of sensitive data. This paper exploits these distinct characteristics
+of LCCTs to develop targeted attack methodologies on two critical security
+risks: jailbreaking and training data extraction attacks. Our experimental
+results expose significant vulnerabilities within LCCTs, including a 99.4%
+success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate
+on Amazon Q. Furthermore, We successfully extracted sensitive user data from
+GitHub Copilot, including 54 real email addresses and 314 physical addresses
+associated with GitHub usernames. Our study also demonstrates that these
+code-based attack methods are effective against general-purpose LLMs, such as
+the GPT series, highlighting a broader security misalignment in the handling of
+code by modern LLMs. These findings underscore critical security challenges
+associated with LCCTs and suggest essential directions for strengthening their
+security frameworks. The example code and attack samples from our research are
+provided at https://github.com/Sensente/Security-Attacks-on-LCCTs.
+
+
+ As the development of large language models (LLMs) rapidly advances, securing
+these models effectively without compromising their utility has become a
+pivotal area of research. However, current defense strategies against jailbreak
+attacks (i.e., efforts to bypass security protocols) often suffer from limited
+adaptability, restricted general capability, and high cost. To address these
+challenges, we introduce SafeAligner, a methodology implemented at the decoding
+stage to fortify defenses against jailbreak attacks. We begin by developing two
+specialized models: the Sentinel Model, which is trained to foster safety, and
+the Intruder Model, designed to generate riskier responses. SafeAligner
+leverages the disparity in security levels between the responses from these
+models to differentiate between harmful and beneficial tokens, effectively
+guiding the safety alignment by altering the output token distribution of the
+target model. Extensive experiments show that SafeAligner can increase the
+likelihood of beneficial tokens, while reducing the occurrence of harmful ones,
+thereby ensuring secure alignment with minimal loss to generality.
+
+
+
+
+
+
+
+
+ Daniel Nahmias, Gal Engelberg, Dan Klein, Asaf Shabtai
+
+
+ Spear-phishing attacks present a significant security challenge, with large
+language models (LLMs) escalating the threat by generating convincing emails
+and facilitating target reconnaissance. To address this, we propose a detection
+approach based on a novel document vectorization method that utilizes an
+ensemble of LLMs to create representation vectors. By prompting LLMs to reason
+and respond to human-crafted questions, we quantify the presence of common
+persuasion principles in the email's content, producing prompted contextual
+document vectors for a downstream supervised machine learning model. We
+evaluate our method using a unique dataset generated by a proprietary system
+that automates target reconnaissance and spear-phishing email creation. Our
+method achieves a 91\% F1 score in identifying LLM-generated spear-phishing
+emails, with the training set comprising only traditional phishing and benign
+emails. Key contributions include a novel document vectorization method
+utilizing LLM reasoning, a publicly available dataset of high-quality
+spear-phishing emails, and the demonstrated effectiveness of our method in
+detecting such emails. This methodology can be utilized for various document
+classification tasks, particularly in adversarial problem domains.
+
+
+
+
+
+
+
+ ♻ ☆ A Review of the Marathi Natural Language Processing
+
+
+ Marathi is one of the most widely used languages in the world. One might
+expect that the latest advances in NLP research in languages like English reach
+such a large community. However, NLP advancements in English didn't immediately
+reach Indian languages like Marathi. There were several reasons for this. They
+included diversity of scripts used, lack of (publicly available) resources like
+tokenization strategies, high quality datasets \& benchmarks, and evaluation
+metrics. In addition to this, the morphologically rich nature of Marathi, made
+NLP tasks challenging. Advances in Neural Network (NN) based models and tools
+since the early 2000s helped improve this situation and make NLP research more
+accessible. In the past 10 years, significant efforts were made to improve
+language resources for all 22 scheduled languages of India. This paper presents
+a broad overview of evolution of NLP research in Indic languages with a focus
+on Marathi and state-of-the-art resources and tools available to the research
+community. It also provides an overview of tools \& techniques associated with
+Marathi NLP tasks.
+
+
+ Beyond-triple fact representations including hyper-relational facts with
+auxiliary key-value pairs, temporal facts with additional timestamps, and
+nested facts implying relationships between facts, are gaining significant
+attention. However, existing link prediction models are usually designed for
+one specific type of facts, making it difficult to generalize to other fact
+representations. To overcome this limitation, we propose a Unified Hierarchical
+Representation learning framework (UniHR) for unified knowledge graph link
+prediction. It consists of a unified Hierarchical Data Representation (HiDR)
+module and a unified Hierarchical Structure Learning (HiSL) module as graph
+encoder. The HiDR module unifies hyper-relational KGs, temporal KGs, and nested
+factual KGs into triple-based representations. Then HiSL incorporates
+intra-fact and inter-fact message passing, focusing on enhancing the semantic
+information within individual facts and enriching the structural information
+between facts. Experimental results across 7 datasets from 3 types of KGs
+demonstrate that our UniHR outperforms baselines designed for one specific kind
+of KG, indicating strong generalization capability of HiDR form and the
+effectiveness of HiSL module. Code and data are available at
+https://github.com/Lza12a/UniHR.
+
+
+
+
+
+
+
+ ♻ ☆ TableRAG: Million-Token Table Understanding with Language Models NeurIPS 2024
+
+
+ Recent advancements in language models (LMs) have notably enhanced their
+ability to reason with tabular data, primarily through program-aided mechanisms
+that manipulate and analyze tables. However, these methods often require the
+entire table as input, leading to scalability challenges due to the positional
+bias or context length constraints. In response to these challenges, we
+introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework
+specifically designed for LM-based table understanding. TableRAG leverages
+query expansion combined with schema and cell retrieval to pinpoint crucial
+information before providing it to the LMs. This enables more efficient data
+encoding and precise retrieval, significantly reducing prompt lengths and
+mitigating information loss. We have developed two new million-token benchmarks
+from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's
+effectiveness at scale. Our results demonstrate that TableRAG's retrieval
+design achieves the highest retrieval quality, leading to the new
+state-of-the-art performance on large-scale table understanding.
+
+
+ Transformer-based large language models (LLMs) use the key-value (KV) cache
+to significantly accelerate inference by storing the key and value embeddings
+of past tokens. However, this cache consumes significant GPU memory. In this
+work, we introduce HashEvict, an algorithm that uses locality-sensitive hashing
+(LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache
+that are cosine dissimilar to the current query token. This is achieved by
+computing the Hamming distance between binarized Gaussian projections of the
+current token query and cached token keys, with a projection length much
+smaller than the embedding dimension. We maintain a lightweight binary
+structure in GPU memory to facilitate these calculations. Unlike existing
+compression strategies that compute attention to determine token retention,
+HashEvict makes these decisions pre-attention, thereby reducing computational
+costs. Additionally, HashEvict is dynamic - at every decoding step, the key and
+value of the current token replace the embeddings of a token expected to
+produce the lowest attention score. We demonstrate that HashEvict can compress
+the KV cache by 30%-70% while maintaining high performance across reasoning,
+multiple-choice, long-context retrieval and summarization tasks.
+
+
+ Solving complex mathematical problems via system-2 reasoning is a natural
+human skill, yet it remains a significant challenge for current large language
+models (LLMs). We identify the scarcity of deliberate multi-step reasoning data
+as a primary limiting factor. To this end, we introduce Enriched Instruction
+Tuning (EIT), a method that enriches existing human-annotated mathematical
+datasets by synergizing human and AI feedback to create fine-grained reasoning
+trajectories. These datasets are then used to fine-tune open-source LLMs,
+enhancing their mathematical reasoning abilities without reliance on any
+symbolic verification program. Concretely, EIT is composed of two critical
+steps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step
+(ERS). The former generates a high-level plan that breaks down complex
+instructions into a sequence of simpler objectives, while ERS fills in
+reasoning contexts often overlooked by human annotators, creating a smoother
+reasoning trajectory for LLM fine-tuning. Unlike existing CoT prompting methods
+that generate reasoning chains only depending on LLM's internal knowledge, our
+method leverages human-annotated initial answers as ``meta-knowledge'' to help
+LLMs generate more detailed and precise reasoning processes, leading to a more
+trustworthy LLM expert for complex mathematical problems. In experiments, EIT
+achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing
+state-of-the-art fine-tuning and prompting methods, and even matching the
+performance of tool-augmented methods.
+
+
+
+
+
+
+
+ ♻ ☆ RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF
+ for Conversational QA over KGs with RAG
+
+
+
+
+
+
+
+
+ Rishiraj Saha Roy, Chris Hinze, Joel Schlotthauer, Farzad Naderi, Viktor Hangya, Andreas Foltyn, Luzian Hahn, Fabian Kuech
+
+
+ Conversational question answering (ConvQA) is a convenient means of searching
+over RDF knowledge graphs (KGs), where a prevalent approach is to translate
+natural language questions to SPARQL queries. However, SPARQL has certain
+shortcomings: (i) it is brittle for complex intents and conversational
+questions, and (ii) it is not suitable for more abstract needs. Instead, we
+propose a novel two-pronged system where we fuse: (i) SQL-query results over a
+database automatically derived from the KG, and (ii) text-search results over
+verbalizations of KG facts. Our pipeline supports iterative retrieval: when the
+results of any branch are found to be unsatisfactory, the system can
+automatically opt for further rounds. We put everything together in a retrieval
+augmented generation (RAG) setup, where an LLM generates a coherent response
+from accumulated search results. We demonstrate the superiority of our proposed
+system over several baselines on a knowledge graph of BMW automobiles.
+
+
+
+ comment: Accepted at BTW 2025, 10 pages
+
+
+
+
+
+
+ ♻ ☆ Exploring Facets of Language Generation in the Limit
+
+
+ The recent work of Kleinberg & Mullainathan [KM24] provides a concrete model
+for language generation in the limit: given a sequence of examples from an
+unknown target language, the goal is to generate new examples from the target
+language such that no incorrect examples are generated beyond some point. In
+sharp contrast to strong negative results for the closely related problem of
+language identification, they establish positive results for language
+generation in the limit for all countable collections of languages. Follow-up
+work by Raman & Tewari [RT24] studies bounds on the number of distinct inputs
+required by an algorithm before correct language generation is achieved --
+namely, whether this is a constant for all languages in the collection (uniform
+generation) or a language-dependent constant (non-uniform generation).
+ We show that every countable language collection has a generator which has
+the stronger property of non-uniform generation in the limit. However, while
+the generation algorithm of [KM24] can be implemented using membership queries,
+we show that any algorithm cannot non-uniformly generate even for collections
+of just two languages, using only membership queries.
+ We also formalize the tension between validity and breadth in the generation
+algorithm of [KM24] by introducing a definition of exhaustive generation, and
+show a strong negative result for exhaustive generation. Our result shows that
+a tradeoff between validity and breadth is inherent for generation in the
+limit. We also provide a precise characterization of the language collections
+for which exhaustive generation is possible. Finally, inspired by algorithms
+that can choose to obtain feedback, we consider a model of uniform generation
+with feedback, completely characterizing language collections for which such
+uniform generation with feedback is possible in terms of a complexity measure
+of the collection.
+
+
+
+ comment: 31 pages. Fixed typos, updated related work, added results on
+ characterization of exhaustive generation
+
+ Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent
+data with the generative capabilities of Large Language Models (LLMs), ensuring
+that the generated output is not only contextually relevant but also accurate
+and current. We introduce XRAG, an open-source, modular codebase that
+facilitates exhaustive evaluation of the performance of foundational components
+of advanced RAG modules. These components are systematically categorized into
+four core phases: pre-retrieval, retrieval, post-retrieval, and generation. We
+systematically analyse them across reconfigured datasets, providing a
+comprehensive benchmark for their effectiveness. As the complexity of RAG
+systems continues to escalate, we underscore the critical need to identify
+potential failure points in RAG systems. We formulate a suite of experimental
+methodologies and diagnostic testing protocols to dissect the failure points
+inherent in RAG engineering. Subsequently, we proffer bespoke solutions aimed
+at bolstering the overall performance of these modules. Our work thoroughly
+evaluates the performance of advanced core components in RAG systems, providing
+insights into optimizations for prevalent failure points.
+
+
+
+
+
+
+
+ ♻ ☆ Listening to Patients: A Framework of Detecting and Mitigating Patient
+ Misreport for Medical Dialogue Generation
+
+
+
+
+
+
+
+
+ Lang Qin, Yao Zhang, Hongru Liang, Adam Jatowt, Zhenglu Yang
+
+
+ Medical Dialogue Systems aim to provide automated healthcare support through
+patient-agent conversations. Previous efforts typically regard patients as
+ideal users -- one who accurately and consistently reports their health
+conditions. However, in reality, patients often misreport their symptoms,
+leading to discrepancies between their reports and actual health conditions.
+Overlooking patient misreport will affect the quality of healthcare
+consultations provided by MDS. To address this issue, we argue that MDS should
+''listen to patients'' and tackle two key challenges: how to detect and
+mitigate patient misreport effectively. In this work, we propose PaMis, a
+framework of detecting and mitigating Patient Misreport for medical dialogue
+generation. PaMis first constructs dialogue entity graphs, then detects patient
+misreport based on graph entropy, and mitigates patient misreport by
+formulating clarifying questions. Experiments indicate that PaMis effectively
+enhances medical response generation, enabling models like GPT-4 to detect and
+mitigate patient misreports, and provide high-quality healthcare assistance.
+
+
+
+
+
+
+
+ ♻ ☆ Re-examining learning linear functions in context
+
+
+ In-context learning (ICL) has emerged as a powerful paradigm for easily
+adapting Large Language Models (LLMs) to various tasks. However, our
+understanding of how ICL works remains limited. We explore a simple model of
+ICL in a controlled setup with synthetic training data to investigate ICL of
+univariate linear functions. We experiment with a range of GPT-2-like
+transformer models trained from scratch. Our findings challenge the prevailing
+narrative that transformers adopt algorithmic approaches like linear regression
+to learn a linear function in-context. These models fail to generalize beyond
+their training distribution, highlighting fundamental limitations in their
+capacity to infer abstract task structures. Our experiments lead us to propose
+a mathematically precise hypothesis of what the model might be learning.
+
+
+
+
+
+
+
+ ♻ ☆ GPTEval: A Survey on Assessments of ChatGPT and GPT-4
+
+
+
+
+
+
+
+
+ Rui Mao, Guanyi Chen, Xulang Zhang, Frank Guerin, Erik Cambria
+
+
+ The emergence of ChatGPT has generated much speculation in the press about
+its potential to disrupt social and economic systems. Its astonishing language
+ability has aroused strong curiosity among scholars about its performance in
+different domains. There have been many studies evaluating the ability of
+ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive
+review summarizing the collective assessment findings is lacking. The objective
+of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4,
+focusing on its language and reasoning abilities, scientific knowledge, and
+ethical considerations. Furthermore, an examination of the existing evaluation
+methods is conducted, offering several recommendations for future research in
+evaluating large language models.
+
+
+
+
+
+
+
+ ♻ ☆ Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive
+ Learning with Dense Labeling
+
+
+ Growing labor shortages are increasing the demand for domestic service robots
+(DSRs) to assist in various settings. In this study, we develop a DSR that
+transports everyday objects to specified pieces of furniture based on
+open-vocabulary instructions. Our approach focuses on retrieving images of
+target objects and receptacles from pre-collected images of indoor
+environments. For example, given an instruction "Please get the right red towel
+hanging on the metal towel rack and put it in the white washing machine on the
+left," the DSR is expected to carry the red towel to the washing machine based
+on the retrieved images. This is challenging because the correct images should
+be retrieved from thousands of collected images, which may include many images
+of similar towels and appliances. To address this, we propose RelaX-Former,
+which learns diverse and robust representations from among positive, unlabeled
+positive, and negative samples. We evaluated RelaX-Former on a dataset
+containing real-world indoor images and human annotated instructions including
+complex referring expressions. The experimental results demonstrate that
+RelaX-Former outperformed existing baseline models across standard image
+retrieval metrics. Moreover, we performed physical experiments using a DSR to
+evaluate the performance of our approach in a zero-shot transfer setting. The
+experiments involved the DSR to carry objects to specific receptacles based on
+open-vocabulary instructions, achieving an overall success rate of 75%.
+
+
+
+ comment: Accepted for IEEE RA-L 2025
+
+
+
+
+
+
+ ♻ ☆ On the loss of context-awareness in general instruction fine-tuning
+
+
+ Pre-trained Large Language Models (LLMs) require post-training methods such
+as supervised fine-tuning (SFT) on instruction-response pairs to enable
+instruction following. However, this process can potentially harm existing
+capabilities learned during pre-training. In this paper, we investigate the
+loss of context awareness after SFT, where context awareness is defined as the
+ability to extract and understand information from user-provided context and
+respond accordingly. We are the first to identify and show that the loss of
+context awareness, as reflected by the performance drop in the
+Needle-in-a-Haystack test, occurs in instruction fine-tuned LLMs when the chat
+template is applied to input prompts. We identify that the performance decline
+is partially caused by an attention bias toward different roles learned during
+conversational instruction fine-tuning. We validate our hypothesis by
+visualizing changes in attention allocation after the chat template is applied
+and manually steering the attention heads. Based on these observations, we
+propose a metric to select context-dependent examples from general instruction
+fine-tuning datasets. We then apply conditional instruction fine-tuning with a
+context-dependency indicator, enabling the model to learn context awareness
+from these selected examples. Empirical experiments on four context-dependent
+downstream tasks and three pre-trained LLMs of different sizes show that our
+method effectively mitigates the loss of context awareness without compromising
+general instruction-following capabilities. Given our findings, we strongly
+advocate for careful benchmarking of context awareness after instruction
+fine-tuning.
+
+
+
+
+
+
+
+ ♻ ☆ LLM-GAN: Construct Generative Adversarial Network Through Large Language
+ Models For Explainable Fake News Detection
+
+
+ Explainable fake news detection predicts the authenticity of news items with
+annotated explanations. Today, Large Language Models (LLMs) are known for their
+powerful natural language understanding and explanation generation abilities.
+However, presenting LLMs for explainable fake news detection remains two main
+challenges. Firstly, fake news appears reasonable and could easily mislead
+LLMs, leaving them unable to understand the complex news-faking process.
+Secondly, utilizing LLMs for this task would generate both correct and
+incorrect explanations, which necessitates abundant labor in the loop. In this
+paper, we propose LLM-GAN, a novel framework that utilizes prompting mechanisms
+to enable an LLM to become Generator and Detector and for realistic fake news
+generation and detection. Our results demonstrate LLM-GAN's effectiveness in
+both prediction performance and explanation quality. We further showcase the
+integration of LLM-GAN to a cloud-native AI platform to provide better fake
+news detection service in the cloud.
+
+
+
+
+
+
+
+ ♻ ☆ Improvement in Sign Language Translation Using Text CTC Alignment
+
+
+ Current sign language translation (SLT) approaches often rely on gloss-based
+supervision with Connectionist Temporal Classification (CTC), limiting their
+ability to handle non-monotonic alignments between sign language video and
+spoken text. In this work, we propose a novel method combining joint
+CTC/Attention and transfer learning. The joint CTC/Attention introduces
+hierarchical encoding and integrates CTC with the attention mechanism during
+decoding, effectively managing both monotonic and non-monotonic alignments.
+Meanwhile, transfer learning helps bridge the modality gap between vision and
+language in SLT. Experimental results on two widely adopted benchmarks,
+RWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves
+results comparable to state-of-the-art and outperforms the pure-attention
+baseline. Additionally, this work opens a new door for future research into
+gloss-free SLT using text-based CTC alignment.
+
+
+
+
+
+
+
+
+ Yibo Zhao, Jiapeng Zhu, Can Xu, Xiang Li
+
+
+ The rapid growth of social media platforms has raised significant concerns
+regarding online content toxicity. When Large Language Models (LLMs) are used
+for toxicity detection, two key challenges emerge: 1) the absence of
+domain-specific toxic knowledge leads to false negatives; 2) the excessive
+sensitivity of LLMs to toxic speech results in false positives, limiting
+freedom of speech. To address these issues, we propose a novel method called
+MetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance
+hatred and toxicity detection. First, we construct a comprehensive meta-toxic
+knowledge graph by utilizing LLMs to extract toxic information through a
+three-step pipeline, with toxic benchmark datasets serving as corpora. Second,
+we query the graph via retrieval and ranking processes to supplement accurate,
+relevant toxic knowledge. Extensive experiments and in-depth case studies
+across multiple datasets demonstrate that our MetaTox significantly decreases
+the false positive rate while boosting overall toxicity detection performance.
+Our code will be available soon.
+
+
+
+ comment: 8 pages of content
+
+
+
+
+
+
+ ♻ ☆ L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text
+ Compression
+
+
+
+
+
+
+
+
+ Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song
+
+
+ Learning-based probabilistic models can be combined with an entropy coder for
+data compression. However, due to the high complexity of learning-based models,
+their practical application as text compressors has been largely overlooked. To
+address this issue, our work focuses on a low-complexity design while
+maintaining compression performance. We introduce a novel Learned Lossless
+Low-complexity Text Compression method (L3TC). Specifically, we conduct
+extensive experiments demonstrating that RWKV models achieve the fastest
+decoding speed with a moderate compression ratio, making it the most suitable
+backbone for our method. Second, we propose an outlier-aware tokenizer that
+uses a limited vocabulary to cover frequent tokens while allowing outliers to
+bypass the prediction and encoding. Third, we propose a novel high-rank
+reparameterization strategy that enhances the learning capability during
+training without increasing complexity during inference. Experimental results
+validate that our method achieves 48% bit saving compared to gzip compressor.
+Besides, L3TC offers compression performance comparable to other learned
+compressors, with a 50x reduction in model parameters. More importantly, L3TC
+is the fastest among all learned compressors, providing real-time decoding
+speeds up to megabytes per second. Our code is available at
+https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.
+
+
+
+
+
+
+
+ ♻ ☆ Adapting Whisper for Code-Switching through Encoding Refining and
+ Language-Aware Decoding
+
+
+ Code-switching (CS) automatic speech recognition (ASR) faces challenges due
+to the language confusion resulting from accents, auditory similarity, and
+seamless language switches. Adaptation on the pre-trained multi-lingual model
+has shown promising performance for CS-ASR. In this paper, we adapt Whisper,
+which is a large-scale multilingual pre-trained speech recognition model, to CS
+from both encoder and decoder parts. First, we propose an encoder refiner to
+enhance the encoder's capacity of intra-sentence swithching. Second, we propose
+using two sets of language-aware adapters with different language prompt
+embeddings to achieve language-specific decoding information in each decoder
+layer. Then, a fusion module is added to fuse the language-aware decoding. The
+experimental results using the SEAME dataset show that, compared with the
+baseline model, the proposed approach achieves a relative MER reduction of 4.1%
+and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing
+state-of-the-art methods. Through experiments, we found that the proposed
+method significantly improves the performance on non-native language in CS
+speech, indicating that our approach enables Whisper to better distinguish
+between the two languages.
+
+
+
+
+
+
+
+ ♻ ☆ Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large
+ Language Models
+
+
+ Recent advancements in large language models (LLMs) have led to significant
+breakthroughs in mathematical reasoning capabilities. However, existing
+benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g.,
+OpenAI o1 achieves 94.8\% on MATH dataset), indicating their inadequacy for
+truly challenging these models. To bridge this gap, we propose a comprehensive
+and challenging benchmark specifically designed to assess LLMs' mathematical
+reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks,
+our dataset focuses exclusively on mathematics and comprises a vast collection
+of 4428 competition-level problems with rigorous human annotation. These
+problems are meticulously categorized into over 33 sub-domains and span more
+than 10 distinct difficulty levels, enabling a holistic assessment of model
+performance in Olympiad-mathematical reasoning. Furthermore, we conducted an
+in-depth analysis based on this benchmark. Our experimental results show that
+even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle
+with highly challenging Olympiad-level problems, with 60.54\% and 52.55\%
+accuracy, highlighting significant challenges in Olympiad-level mathematical
+reasoning.
+
+
+
+ comment: 30 pages
+
+
+
+
+
+
+ ♻ ☆ Revisiting Jailbreaking for Large Language Models: A Representation
+ Engineering Perspective COLING 2025
+
+
+ The recent surge in jailbreaking attacks has revealed significant
+vulnerabilities in Large Language Models (LLMs) when exposed to malicious
+inputs. While various defense strategies have been proposed to mitigate these
+threats, there has been limited research into the underlying mechanisms that
+make LLMs vulnerable to such attacks. In this study, we suggest that the
+self-safeguarding capability of LLMs is linked to specific activity patterns
+within their representation space. Although these patterns have little impact
+on the semantic content of the generated text, they play a crucial role in
+shaping LLM behavior under jailbreaking attacks. Our findings demonstrate that
+these patterns can be detected with just a few pairs of contrastive queries.
+Extensive experimentation shows that the robustness of LLMs against
+jailbreaking can be manipulated by weakening or strengthening these patterns.
+Further visual analysis provides additional evidence for our conclusions,
+providing new insights into the jailbreaking phenomenon. These findings
+highlight the importance of addressing the potential misuse of open-source LLMs
+within the community.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ♻ ☆ From Models to Microtheories: Distilling a Model's Topical Knowledge for
+ Grounded Question Answering
+
+
+
+
+
+
+
+
+ Nathaniel Weir, Bhavana Dalvi Mishra, Orion Weller, Oyvind Tafjord, Sam Hornstein, Alexander Sabol, Peter Jansen, Benjamin Van Durme, Peter Clark
+
+
+ Recent reasoning methods (e.g., chain-of-thought, entailment reasoning) help
+users understand how language models (LMs) answer a single question, but they
+do little to reveal the LM's overall understanding, or "theory," about the
+question's topic, making it still hard to trust the model. Our goal is to
+materialize such theories - here called microtheories (a linguistic analog of
+logical microtheories) - as a set of sentences encapsulating an LM's core
+knowledge about a topic. These statements systematically work together to
+entail answers to a set of questions to both engender trust and improve
+performance. Our approach is to first populate a knowledge store with
+(model-generated) sentences that entail answers to training questions and then
+distill those down to a core microtheory that is concise, general, and
+non-redundant. We show that, when added to a general corpus (e.g., Wikipedia),
+microtheories can supply critical, topical information not necessarily present
+in the corpus, improving both a model's ability to ground its answers to
+verifiable knowledge (i.e., show how answers are systematically entailed by
+documents in the corpus, fully grounding up to +8% more answers), and the
+accuracy of those grounded answers (up to +8% absolute). We also show that, in
+a human evaluation in the medical domain, our distilled microtheories contain a
+significantly higher concentration of topically critical facts than the
+non-distilled knowledge store. Finally, we show we can quantify the coverage of
+a microtheory for a topic (characterized by a dataset) using a notion of
+$p$-relevance. Together, these suggest that microtheories are an efficient
+distillation of an LM's topic-relevant knowledge, that they can usefully
+augment existing corpora, and can provide both performance gains and an
+interpretable, verifiable window into the model's knowledge of a topic.
+
+
+
+
+
+
+
+ ♻ ☆ Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with
+ an LLM-Enabled Simulation
+
+
+
+
+
+
+
+
+ Zirong Chen, Elizabeth Chason, Noah Mladenovski, Erin Wilson, Kristin Mullen, Stephen Martini, Meiyi Ma
+
+
+ Emergency response services are vital for enhancing public safety by
+safeguarding the environment, property, and human lives. As frontline members
+of these services, 9-1-1 dispatchers have a direct impact on response times and
+the overall effectiveness of emergency operations. However, traditional
+dispatcher training methods, which rely on role-playing by experienced
+personnel, are labor-intensive, time-consuming, and often neglect the specific
+needs of underserved communities. To address these challenges, we introduce
+Sim911, the first training simulation for 9-1-1 dispatchers powered by Large
+Language Models (LLMs). Sim911 enhances training through three key technical
+innovations: (1) knowledge construction, which utilizes archived 9-1-1 call
+data to generate simulations that closely mirror real-world scenarios; (2)
+context-aware controlled generation, which employs dynamic prompts and vector
+bases to ensure that LLM behavior aligns with training objectives; and (3)
+validation with looped correction, which filters out low-quality responses and
+refines the system performance.
+
+
+
+
+
+
+
+
+ Xi Cao, Quzong Gesang, Yuan Sun, Nuo Qun, Tashi Nyima
+
+
+ Language models based on deep neural networks are vulnerable to textual
+adversarial attacks. While rich-resource languages like English are receiving
+focused attention, Tibetan, a cross-border language, is gradually being studied
+due to its abundant ancient literature and critical language strategy.
+Currently, there are several Tibetan adversarial text generation methods, but
+they do not fully consider the textual features of Tibetan script and
+overestimate the quality of generated adversarial texts. To address this issue,
+we propose a novel Tibetan adversarial text generation method called TSCheater,
+which considers the characteristic of Tibetan encoding and the feature that
+visually similar syllables have similar semantics. This method can also be
+transferred to other abugidas, such as Devanagari script. We utilize a
+self-constructed Tibetan syllable visual similarity database called TSVSDB to
+generate substitution candidates and adopt a greedy algorithm-based scoring
+mechanism to determine substitution order. After that, we conduct the method on
+eight victim language models. Experimentally, TSCheater outperforms existing
+methods in attack effectiveness, perturbation magnitude, semantic similarity,
+visual similarity, and human acceptance. Finally, we construct the first
+Tibetan adversarial robustness evaluation benchmark called AdvTS, which is
+generated by existing methods and proofread by humans.
+
+
+
+ comment: Pre-Camera-Ready Version; Accepted at ICASSP 2025
+
+ Long-form document matching aims to judge the relevance between two documents
+and has been applied to various scenarios. Most existing works utilize
+hierarchical or long context models to process documents, which achieve coarse
+understanding but may ignore details. Some researchers construct a document
+view with similar sentences about aligned document subtopics to focus on
+detailed matching signals. However, a long document generally contains multiple
+subtopics. The matching signals are heterogeneous from multiple topics.
+Considering only the homologous aligned subtopics may not be representative
+enough and may cause biased modeling. In this paper, we introduce a new
+framework to model representative matching signals. First, we propose to
+capture various matching signals through subtopics of document pairs. Next, We
+construct multiple document views based on subtopics to cover heterogeneous and
+valuable details. However, existing spatial aggregation methods like attention,
+which integrate all these views simultaneously, are hard to integrate
+heterogeneous information. Instead, we propose temporal aggregation, which
+effectively integrates different views gradually as the training progresses.
+Experimental results show that our learning framework is effective on several
+document-matching tasks, including news duplication and legal case retrieval.
+
+
+
+
+
+
+
+ ♻ ☆ Is Parameter Collision Hindering Continual Learning in LLMs?
+
+
+ Large Language Models (LLMs) often suffer from catastrophic forgetting when
+learning multiple tasks sequentially, making continual learning (CL) essential
+for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as
+O-LoRA, typically focus on constructing orthogonality tasks to decouple
+parameter interdependence from various domains.In this paper, we reveal that
+building non-collision parameters is a more critical factor in addressing CL
+challenges. Our theoretical and experimental analyses demonstrate that
+non-collision parameters can provide better task orthogonality, which is a
+sufficient but unnecessary condition. Furthermore, knowledge from multiple
+domains will be preserved in non-collision parameter subspaces, making it more
+difficult to forget previously seen data. Leveraging this insight, we propose
+Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach
+leveraging low collision rates to enhance CL in LLMs. Experimental results on
+multiple CL benchmarks indicate that N-LoRA achieves superior performance
+(+2.9), higher task orthogonality (*4.1 times), and lower parameter collision
+(*58.1 times) than SOTA methods.
+
+
+
+
+
+
+
+ ♻ ☆ MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for
+ Table Reasoning COLING 2025
+
+
+
+
+
+
+
+
+ Zheng Li, Yang Du, Mao Zheng, Mingyang Song
+
+
+ Extensive research has been conducted to explore the capability of Large
+Language Models (LLMs) for table reasoning and has significantly improved the
+performance on existing benchmarks. However, tables and user questions in
+real-world applications are more complex and diverse, presenting an unignorable
+gap compared to the existing benchmarks. To fill the gap, we propose a
+\textbf{M}ult\textbf{i}-scale spreadsheet benchmark with \textbf{M}eta
+\textbf{o}perations for \textbf{Table} reasoning, named as MiMoTable.
+Specifically, MiMoTable incorporates two key features. First, the tables in
+MiMoTable are all spreadsheets used in real-world scenarios, which cover seven
+domains and contain different types. Second, we define a new criterion with six
+categories of meta operations for measuring the difficulty of each question in
+MiMoTable, simultaneously as a new perspective for measuring the difficulty of
+the existing benchmarks. Experimental results show that Claude-3.5-Sonnet
+achieves the best performance with 77.4\% accuracy, indicating that there is
+still significant room to improve for LLMs on MiMoTable. Furthermore, we grade
+the difficulty of existing benchmarks according to our new criteria.
+Experiments have shown that the performance of LLMs decreases as the difficulty
+of benchmarks increases, thereby proving the effectiveness of our proposed new
+criterion.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark
+ for Evaluating Long-Context Large Language Models COLING 2025
+
+
+
+
+
+
+
+
+ Mingyang Song, Mao Zheng, Xuan Luo
+
+
+ Despite recent efforts to develop large language models with robust
+long-context capabilities, the lack of long-context benchmarks means that
+relatively little is known about their performance. To alleviate this gap, in
+this paper, we propose \textbf{Counting-Stars}, a multi-evidence,
+position-aware, and scalable benchmark designed to evaluate the multi-evidence
+retrieval capabilities of long-context LLMs. \textbf{Counting-Stars} comprises
+two counting-based multiple pieces of evidence retrieval sub-tasks: searching
+and reasoning. Using Counting-Stars, we conduct experiments to evaluate several
+long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4,
+and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro
+achieves the best overall results, while GPT-4 Turbo exhibits the most stable
+performance across various tasks. Furthermore, our analysis of these LLMs,
+which have been extended to handle long-context scenarios, indicates that
+significant room for improvement remains as the length of the input context and
+the complexity of the tasks increase.
+
+
+ Fill-in-the-Middle (FIM) has become integral to code language models,
+enabling generation of missing code given both left and right contexts.
+However, the current FIM training paradigm, which reorders original training
+sequences and then performs regular next-token prediction (NTP), often leads to
+models struggling to generate content that aligns smoothly with the surrounding
+context. Crucially, while existing works rely on rule-based post-processing to
+circumvent this weakness, such methods are not practically usable in
+open-domain code completion tasks as they depend on restrictive,
+dataset-specific assumptions (e.g., generating the same number of lines as in
+the ground truth). Moreover, model performance on FIM tasks deteriorates
+significantly without these unrealistic assumptions.
+ We hypothesize that NTP alone is insufficient for models to learn effective
+planning conditioned on the distant right context, a critical factor for
+successful code infilling. To overcome this, we propose Horizon-Length
+Prediction (HLP), a novel training objective that teaches models to predict the
+number of remaining middle tokens (i.e., horizon length) at each step. HLP
+advances FIM with lookahead planning, enabling models to inherently learn
+infilling boundaries for arbitrary left and right contexts without relying on
+dataset-specific post-processing. Our evaluation across different models and
+sizes shows that HLP significantly improves FIM performance by up to 24%
+relatively on diverse benchmarks, across file-level and repository-level, and
+without resorting to unrealistic post-processing methods. Furthermore, the
+enhanced planning capability gained through HLP boosts model performance on
+code reasoning. Importantly, HLP only incurs negligible training overhead and
+no additional inference cost, ensuring its practicality for real-world
+scenarios.
+
+
+ We present an efficient encoder-free approach for video-language
+understanding that achieves competitive performance while significantly
+reducing computational overhead. Current video-language models typically rely
+on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B
+parameters), creating a substantial computational burden when processing
+multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment
+Block (STAB) that directly processes video inputs without requiring pre-trained
+encoders while using only 45M parameters for visual processing - at least a
+6.5$\times$ reduction compared to traditional approaches. The STAB architecture
+combines Local Spatio-Temporal Encoding for fine-grained feature extraction,
+efficient spatial downsampling through learned attention and separate
+mechanisms for modeling frame-level and video-level relationships. Our model
+achieves comparable or superior performance to encoder-based approaches for
+open-ended video question answering on standard benchmarks. The fine-grained
+video question-answering evaluation demonstrates our model's effectiveness,
+outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key
+aspects like correctness and temporal understanding. Extensive ablation studies
+validate our architectural choices and demonstrate the effectiveness of our
+spatio-temporal modeling approach while achieving 3-4$\times$ faster processing
+speeds than previous methods. Code is available at
+\url{https://github.com/jh-yi/Video-Panda}.
+
+
+
+
+
+
+
+ ☆ PartGen: Part-level 3D Generation and Reconstruction with Multi-View
+ Diffusion Models
+
+
+
+
+
+
+
+
+ Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, Andrea Vedaldi
+
+
+ Text- or image-to-3D generators and 3D scanners can now produce 3D assets
+with high-quality shapes and textures. These assets typically consist of a
+single, fused representation, like an implicit neural field, a Gaussian
+mixture, or a mesh, without any useful structure. However, most applications
+and creative workflows require assets to be made of several meaningful parts
+that can be manipulated independently. To address this gap, we introduce
+PartGen, a novel approach that generates 3D objects composed of meaningful
+parts starting from text, an image, or an unstructured 3D object. First, given
+multiple views of a 3D object, generated or rendered, a multi-view diffusion
+model extracts a set of plausible and view-consistent part segmentations,
+dividing the object into parts. Then, a second multi-view diffusion model takes
+each part separately, fills in the occlusions, and uses those completed views
+for 3D reconstruction by feeding them to a 3D reconstruction network. This
+completion process considers the context of the entire object to ensure that
+the parts integrate cohesively. The generative completion model can make up for
+the information missing due to occlusions; in extreme cases, it can hallucinate
+entirely invisible parts based on the input 3D asset. We evaluate our method on
+generated and real 3D assets and show that it outperforms segmentation and
+part-extraction baselines by a large margin. We also showcase downstream
+applications such as 3D part editing.
+
+
+ World model-based searching and planning are widely recognized as a promising
+path toward human-level physical intelligence. However, current driving world
+models primarily rely on video diffusion models, which specialize in visual
+generation but lack the flexibility to incorporate other modalities like
+action. In contrast, autoregressive transformers have demonstrated exceptional
+capability in modeling multimodal data. Our work aims to unify both driving
+model simulation and trajectory planning into a single sequence modeling
+problem. We introduce a multimodal driving language based on interleaved image
+and action tokens, and develop DrivingGPT to learn joint world modeling and
+planning through standard next-token prediction. Our DrivingGPT demonstrates
+strong performance in both action-conditioned video generation and end-to-end
+planning, outperforming strong baselines on large-scale nuPlan and NAVSIM
+benchmarks.
+
+
+ Orientation is a key attribute of objects, crucial for understanding their
+spatial pose and arrangement in images. However, practical solutions for
+accurate orientation estimation from a single image remain underexplored. In
+this work, we introduce Orient Anything, the first expert and foundational
+model designed to estimate object orientation in a single- and free-view image.
+Due to the scarcity of labeled data, we propose extracting knowledge from the
+3D world. By developing a pipeline to annotate the front face of 3D objects and
+render images from random views, we collect 2M images with precise orientation
+annotations. To fully leverage the dataset, we design a robust training
+objective that models the 3D orientation as probability distributions of three
+angles and predicts the object orientation by fitting these distributions.
+Besides, we employ several strategies to improve synthetic-to-real transfer.
+Our model achieves state-of-the-art orientation estimation accuracy in both
+rendered and real images and exhibits impressive zero-shot ability in various
+scenarios. More importantly, our model enhances many applications, such as
+comprehension and generation of complex spatial concepts and 3D object pose
+adjustment.
+
+
+ Classifiers are important components in many computer vision tasks, serving
+as the foundational backbone of a wide variety of models employed across
+diverse applications. However, understanding the decision-making process of
+classifiers remains a significant challenge. We propose DiffEx, a novel method
+that leverages the capabilities of text-to-image diffusion models to explain
+classifier decisions. Unlike traditional GAN-based explainability models, which
+are limited to simple, single-concept analyses and typically require training a
+new model for each classifier, our approach can explain classifiers that focus
+on single concepts (such as faces or animals) as well as those that handle
+complex scenes involving multiple concepts. DiffEx employs vision-language
+models to create a hierarchical list of semantics, allowing users to identify
+not only the overarching semantic influences on classifiers (e.g., the 'beard'
+semantic in a facial classifier) but also their sub-types, such as 'goatee' or
+'Balbo' beard. Our experiments demonstrate that DiffEx is able to cover a
+significantly broader spectrum of semantics compared to its GAN counterparts,
+providing a hierarchical tool that delivers a more detailed and fine-grained
+understanding of classifier decisions.
+
+
+
+
+
+
+
+ ☆ ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation
+
+
+ Human-scene interaction (HSI) generation is crucial for applications in
+embodied AI, virtual reality, and robotics. While existing methods can
+synthesize realistic human motions in 3D scenes and generate plausible
+human-object interactions, they heavily rely on datasets containing paired 3D
+scene and motion capture data, which are expensive and time-consuming to
+collect across diverse environments and interactions. We present ZeroHSI, a
+novel approach that enables zero-shot 4D human-scene interaction synthesis by
+integrating video generation and neural human rendering. Our key insight is to
+leverage the rich motion priors learned by state-of-the-art video generation
+models, which have been trained on vast amounts of natural human movements and
+interactions, and use differentiable rendering to reconstruct human-scene
+interactions. ZeroHSI can synthesize realistic human motions in both static
+scenes and environments with dynamic objects, without requiring any
+ground-truth motion data. We evaluate ZeroHSI on a curated dataset of different
+types of various indoor and outdoor scenes with different interaction prompts,
+demonstrating its ability to generate diverse and contextually appropriate
+human-scene interactions.
+
+
+ Sora-like video generation models have achieved remarkable progress with a
+Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current
+video generation models predominantly focus on single-prompt, struggling to
+generate coherent scenes with multiple sequential prompts that better reflect
+real-world dynamic scenarios. While some pioneering works have explored
+multi-prompt video generation, they face significant challenges including
+strict training data requirements, weak prompt following, and unnatural
+transitions. To address these problems, we propose DiTCtrl, a training-free
+multi-prompt video generation method under MM-DiT architectures for the first
+time. Our key idea is to take the multi-prompt video generation task as
+temporal video editing with smooth transitions. To achieve this goal, we first
+analyze MM-DiT's attention mechanism, finding that the 3D full attention
+behaves similarly to that of the cross/self-attention blocks in the UNet-like
+diffusion models, enabling mask-guided precise semantic control across
+different prompts with attention sharing for multi-prompt video generation.
+Based on our careful design, the video generated by DiTCtrl achieves smooth
+transitions and consistent object motion given multiple sequential prompts
+without additional training. Besides, we also present MPVBench, a new benchmark
+specially designed for multi-prompt video generation to evaluate the
+performance of multi-prompt generation. Extensive experiments demonstrate that
+our method achieves state-of-the-art performance without additional training.
+
+
+
+
+
+
+
+
+ Kanchana Ranasinghe, Sadeep Jayasumana, Andreas Veit, Ayan Chakrabarti, Daniel Glasner, Michael S Ryoo, Srikumar Ramalingam, Sanjiv Kumar
+
+
+ Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images,
+however, the latency incurred by multiple costly inference iterations can
+restrict their applicability. We introduce LatentCRF, a continuous Conditional
+Random Field (CRF) model, implemented as a neural network layer, that models
+the spatial and semantic relationships among the latent vectors in the LDM. By
+replacing some of the computationally-intensive LDM inference iterations with
+our lightweight LatentCRF, we achieve a superior balance between quality, speed
+and diversity. We increase inference efficiency by 33% with no loss in image
+quality or diversity compared to the full LDM. LatentCRF is an easy add-on,
+which does not require modifying the LDM.
+
+
+
+
+
+
+
+ ☆ ClassifyViStA:WCE Classification with Visual understanding through
+ Segmentation and Attention
+
+
+
+
+
+
+
+
+ S. Balasubramanian, Ammu Abhishek, Yedu Krishna, Darshan Gera
+
+
+ Gastrointestinal (GI) bleeding is a serious medical condition that presents
+significant diagnostic challenges, particularly in settings with limited access
+to healthcare resources. Wireless Capsule Endoscopy (WCE) has emerged as a
+powerful diagnostic tool for visualizing the GI tract, but it requires
+time-consuming manual analysis by experienced gastroenterologists, which is
+prone to human error and inefficient given the increasing number of patients.To
+address this challenge, we propose ClassifyViStA, an AI-based framework
+designed for the automated detection and classification of bleeding and
+non-bleeding frames from WCE videos. The model consists of a standard
+classification path, augmented by two specialized branches: an implicit
+attention branch and a segmentation branch.The attention branch focuses on the
+bleeding regions, while the segmentation branch generates accurate segmentation
+masks, which are used for classification and interpretability. The model is
+built upon an ensemble of ResNet18 and VGG16 architectures to enhance
+classification performance. For the bleeding region detection, we implement a
+Soft Non-Maximum Suppression (Soft NMS) approach with YOLOv8, which improves
+the handling of overlapping bounding boxes, resulting in more accurate and
+nuanced detections.The system's interpretability is enhanced by using the
+segmentation masks to explain the classification results, offering insights
+into the decision-making process similar to the way a gastroenterologist
+identifies bleeding regions. Our approach not only automates the detection of
+GI bleeding but also provides an interpretable solution that can ease the
+burden on healthcare professionals and improve diagnostic efficiency. Our code
+is available at ClassifyViStA.
+
+
+
+
+
+
+
+ ☆ Text-Driven Tumor Synthesis
+
+
+
+
+
+
+
+
+ Xinran Li, Yi Shuai, Chen Liu, Qi Chen, Qilong Wu, Pengfei Guo, Dong Yang, Can Zhao, Pedro R. A. S. Bassi, Daguang Xu, Kang Wang, Yang Yang, Alan Yuille, Zongwei Zhou
+
+
+ Tumor synthesis can generate examples that AI often misses or over-detects,
+improving AI performance by training on these challenging cases. However,
+existing synthesis methods, which are typically unconditional -- generating
+images from random variables -- or conditioned only by tumor shapes, lack
+controllability over specific tumor characteristics such as texture,
+heterogeneity, boundaries, and pathology type. As a result, the generated
+tumors may be overly similar or duplicates of existing training data, failing
+to effectively address AI's weaknesses. We propose a new text-driven tumor
+synthesis approach, termed TextoMorph, that provides textual control over tumor
+characteristics. This is particularly beneficial for examples that confuse the
+AI the most, such as early tumor detection (increasing Sensitivity by +8.5%),
+tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and
+classification between benign and malignant tumors (improving Sensitivity by
++8.2%). By incorporating text mined from radiology reports into the synthesis
+process, we increase the variability and controllability of the synthetic
+tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses
+contrastive learning across different texts and CT scans, significantly
+reducing dependence on scarce image-report pairs (only 141 pairs used in this
+study) by leveraging a large corpus of 34,035 radiology reports. Finally, we
+have developed rigorous tests to evaluate synthetic tumors, including
+Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our
+synthetic tumors is realistic and diverse in texture, heterogeneity,
+boundaries, and pathology.
+
+
+
+
+
+
+
+ ☆ Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors:
+ Diverse-Resolution Training Outperforms Interpolation
+
+
+ Deep learning-based 3D imaging, in particular magnetic resonance imaging
+(MRI), is challenging because of limited availability of 3D training data.
+Therefore, 2D diffusion models trained on 2D slices are starting to be
+leveraged for 3D MRI reconstruction. However, as we show in this paper,
+existing methods pertain to a fixed voxel size, and performance degrades when
+the voxel size is varied, as it is often the case in clinical practice. In this
+paper, we propose and study several approaches for resolution-robust 3D MRI
+reconstruction with 2D diffusion priors. As a result of this investigation, we
+obtain a simple resolution-robust variational 3D reconstruction approach based
+on diffusion-guided regularization of randomly sampled 2D slices. This method
+provides competitive reconstruction quality compared to posterior sampling
+baselines. Towards resolving the sensitivity to resolution-shifts, we
+investigate state-of-the-art model-based approaches including Gaussian
+splatting, neural representations, and infinite-dimensional diffusion models,
+as well as a simple data-centric approach of training the diffusion model on
+several resolutions. Our experiments demonstrate that the model-based
+approaches fail to close the performance gap in 3D MRI. In contrast, the
+data-centric approach of training the diffusion model on various resolutions
+effectively provides a resolution-robust method without compromising accuracy.
+
+
+
+
+
+
+
+ ☆ 3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement
+
+
+ Despite advances in neural rendering, due to the scarcity of high-quality 3D
+datasets and the inherent limitations of multi-view diffusion models, view
+synthesis and 3D model generation are restricted to low resolutions with
+suboptimal multi-view consistency. In this study, we present a novel 3D
+enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent
+diffusion model to enhance coarse 3D inputs while preserving multi-view
+consistency. Our method includes a pose-aware encoder and a diffusion-based
+denoiser to refine low-quality multi-view images, along with data augmentation
+and a multi-view attention module with epipolar aggregation to maintain
+consistent, high-quality 3D outputs across views. Unlike existing video-based
+approaches, our model supports seamless multi-view enhancement with improved
+coherence across diverse viewing angles. Extensive evaluations show that
+3DEnhancer significantly outperforms existing methods, boosting both multi-view
+enhancement and per-instance 3D optimization tasks.
+
+
+
+
+
+
+
+ ☆ Advancing Deformable Medical Image Registration with Multi-axis
+ Cross-covariance Attention
+
+
+
+
+
+
+
+
+ Mingyuan Meng, Michael Fulham, Lei Bi, Jinman Kim
+
+
+ Deformable image registration is a fundamental requirement for medical image
+analysis. Recently, transformers have been widely used in deep learning-based
+registration methods for their ability to capture long-range dependency via
+self-attention (SA). However, the high computation and memory loads of SA
+(growing quadratically with the spatial resolution) hinder transformers from
+processing subtle textural information in high-resolution image features, e.g.,
+at the full and half image resolutions. This limits deformable registration as
+the high-resolution textural information is crucial for finding precise
+pixel-wise correspondence between subtle anatomical structures.
+Cross-covariance Attention (XCA), as a "transposed" version of SA that operates
+across feature channels, has complexity growing linearly with the spatial
+resolution, providing the feasibility of capturing long-range dependency among
+high-resolution image features. However, existing XCA-based transformers merely
+capture coarse global long-range dependency, which are unsuitable for
+deformable image registration relying primarily on fine-grained local
+correspondence. In this study, we propose to improve existing deep
+learning-based registration methods by embedding a new XCA mechanism. To this
+end, we design an XCA-based transformer block optimized for deformable medical
+image registration, named Multi-Axis XCA (MAXCA). Our MAXCA serves as a general
+network block that can be embedded into various registration network
+architectures. It can capture both global and local long-range dependency among
+high-resolution image features by applying regional and dilated XCA in parallel
+via a multi-axis design. Extensive experiments on two well-benchmarked
+inter-/intra-patient registration tasks with seven public medical datasets
+demonstrate that our MAXCA block enables state-of-the-art registration
+performance.
+
+
+
+ comment: Under Review
+
+
+
+
+
+
+ ☆ The Key of Understanding Vision Tasks: Explanatory Instructions
+
+
+ Computer Vision (CV) has yet to fully achieve the zero-shot task
+generalization observed in Natural Language Processing (NLP), despite following
+many of the milestones established in NLP, such as large transformer models,
+extensive pre-training, and the auto-regression paradigm, among others. In this
+paper, we explore the idea that CV adopts discrete and terminological task
+definitions (\eg, ``image segmentation''), which may be a key barrier to
+zero-shot task generalization. Our hypothesis is that without truly
+understanding previously-seen tasks--due to these terminological
+definitions--deep models struggle to generalize to novel tasks. To verify this,
+we introduce Explanatory Instructions, which provide an intuitive way to define
+CV task objectives through detailed linguistic transformations from input
+images to outputs. We create a large-scale dataset comprising 12 million
+``image input $\to$ explanatory instruction $\to$ output'' triplets, and train
+an auto-regressive-based vision-language model (AR-based VLM) that takes both
+images and explanatory instructions as input. By learning to follow these
+instructions, the AR-based VLM achieves instruction-level zero-shot
+capabilities for previously-seen tasks and demonstrates strong zero-shot
+generalization for unseen CV tasks. Code and dataset will be openly available
+on our GitHub repository.
+
+
+
+ comment: 40 pages
+
+
+
+
+
+
+ ☆ HTR-JAND: Handwritten Text Recognition with Joint Attention Network and
+ Knowledge Distillation
+
+
+ Despite significant advances in deep learning, current Handwritten Text
+Recognition (HTR) systems struggle with the inherent complexity of historical
+documents, including diverse writing styles, degraded text quality, and
+computational efficiency requirements across multiple languages and time
+periods. This paper introduces HTR-JAND (HTR-JAND: Handwritten Text Recognition
+with Joint Attention Network and Knowledge Distillation), an efficient HTR
+framework that combines advanced feature extraction with knowledge
+distillation. Our architecture incorporates three key components: (1) a CNN
+architecture integrating FullGatedConv2d layers with Squeeze-and-Excitation
+blocks for adaptive feature extraction, (2) a Combined Attention mechanism
+fusing Multi-Head Self-Attention with Proxima Attention for robust sequence
+modeling, and (3) a Knowledge Distillation framework enabling efficient model
+compression while preserving accuracy through curriculum-based training. The
+HTR-JAND framework implements a multi-stage training approach combining
+curriculum learning, synthetic data generation, and multi-task learning for
+cross-dataset knowledge transfer. We enhance recognition accuracy through
+context-aware T5 post-processing, particularly effective for historical
+documents. Comprehensive evaluations demonstrate HTR-JAND's effectiveness,
+achieving state-of-the-art Character Error Rates (CER) of 1.23\%, 1.02\%, and
+2.02\% on IAM, RIMES, and Bentham datasets respectively. Our Student model
+achieves a 48\% parameter reduction (0.75M versus 1.5M parameters) while
+maintaining competitive performance through efficient knowledge transfer.
+Source code and pre-trained models are available at
+\href{https://github.com/DocumentRecognitionModels/HTR-JAND}{Github}.
+
+
+
+
+
+
+
+ ☆ VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry
+ Extraction from First-Person View Flight Data
+
+
+
+
+
+
+
+
+ James E. Gallagher, Edward J. Oughton
+
+
+ This paper presents the Visual Optical Recognition Telemetry EXtraction
+(VORTEX) system for extracting and analyzing drone telemetry data from First
+Person View (FPV) Uncrewed Aerial System (UAS) footage. VORTEX employs MMOCR, a
+PyTorch-based Optical Character Recognition (OCR) toolbox, to extract telemetry
+variables from drone Heads Up Display (HUD) recordings, utilizing advanced
+image preprocessing techniques, including CLAHE enhancement and adaptive
+thresholding. The study optimizes spatial accuracy and computational efficiency
+through systematic investigation of temporal sampling rates (1s, 5s, 10s, 15s,
+20s) and coordinate processing methods. Results demonstrate that the 5-second
+sampling rate, utilizing 4.07% of available frames, provides the optimal
+balance with a point retention rate of 64% and mean speed accuracy within 4.2%
+of the 1-second baseline while reducing computational overhead by 80.5%.
+Comparative analysis of coordinate processing methods reveals that while UTM
+Zone 33N projection and Haversine calculations provide consistently similar
+results (within 0.1% difference), raw WGS84 coordinates underestimate distances
+by 15-30% and speeds by 20-35%. Altitude measurements showed unexpected
+resilience to sampling rate variations, with only 2.1% variation across all
+intervals. This research is the first of its kind, providing quantitative
+benchmarks for establishing a robust framework for drone telemetry extraction
+and analysis using open-source tools and spatial libraries.
+
+
+
+
+
+
+
+ ☆ A region-wide, multi-year set of crop field boundary labels for Africa
+
+
+
+
+
+
+
+
+ L. D. Estes, A. Wussah, M. Asipunu, M. Gathigi, P. Kovačič, J. Muhando, B. V. Yeboah, F. K. Addai, E. S. Akakpo, M. K. Allotey, P. Amkoya, E. Amponsem, K. D. Donkoh, N. Ha, E. Heltzel, C. Juma, R. Mdawida, A. Miroyo, J. Mucha, J. Mugami, F. Mwawaza, D. A. Nyarko, P. Oduor, K. N. Ohemeng, S. I. D. Segbefia, T. Tumbula, F. Wambua, G. H. Xeflide, S. Ye, F. Yeboah
+
+
+ African agriculture is undergoing rapid transformation. Annual maps of crop
+fields are key to understanding the nature of this transformation, but such
+maps are currently lacking and must be developed using advanced machine
+learning models trained on high resolution remote sensing imagery. To enable
+the development of such models, we delineated field boundaries in 33,746 Planet
+images captured between 2017 and 2023 across the continent using a custom
+labeling platform with built-in procedures for assessing and mitigating label
+error. We collected 42,403 labels, including 7,204 labels arising from tasks
+dedicated to assessing label quality (Class 1 labels), 32,167 from sites mapped
+once by a single labeller (Class 2) and 3,032 labels from sites where 3 or more
+labellers were tasked to map the same location (Class 4). Class 1 labels were
+used to calculate labeller-specific quality scores, while Class 1 and 4 sites
+mapped by at least 3 labellers were used to further evaluate label uncertainty
+using a Bayesian risk metric. Quality metrics showed that label quality was
+moderately high (0.75) for measures of total field extent, but low regarding
+the number of individual fields delineated (0.33), and the position of field
+edges (0.05). These values are expected when delineating small-scale fields in
+3-5 m resolution imagery, which can be too coarse to reliably distinguish
+smaller fields, particularly in dense croplands, and therefore requires
+substantial labeller judgement. Nevertheless, previous work shows that such
+labels can train effective field mapping models. Furthermore, this large,
+probabilistic sample on its own provides valuable insight into regional
+agricultural characteristics, highlighting variations in the median field size
+and density. The imagery and vectorized labels along with quality information
+is available for download from two public repositories.
+
+
+
+ comment: 22 pages, 8 figures
+
+
+
+
+
+
+ ☆ Underwater Image Restoration via Polymorphic Large Kernel CNNs ICASSP2025
+
+
+ Underwater Image Restoration (UIR) remains a challenging task in computer
+vision due to the complex degradation of images in underwater environments.
+While recent approaches have leveraged various deep learning techniques,
+including Transformers and complex, parameter-heavy models to achieve
+significant improvements in restoration effects, we demonstrate that pure CNN
+architectures with lightweight parameters can achieve comparable results. In
+this paper, we introduce UIR-PolyKernel, a novel method for underwater image
+restoration that leverages Polymorphic Large Kernel CNNs. Our approach uniquely
+combines large kernel convolutions of diverse sizes and shapes to effectively
+capture long-range dependencies within underwater imagery. Additionally, we
+introduce a Hybrid Domain Attention module that integrates frequency and
+spatial domain attention mechanisms to enhance feature importance. By
+leveraging the frequency domain, we can capture hidden features that may not be
+perceptible to humans but are crucial for identifying patterns in both
+underwater and on-air images. This approach enhances the generalization and
+robustness of our UIR model. Extensive experiments on benchmark datasets
+demonstrate that UIR-PolyKernel achieves state-of-the-art performance in
+underwater image restoration tasks, both quantitatively and qualitatively. Our
+results show that well-designed pure CNN architectures can effectively compete
+with more complex models, offering a balance between performance and
+computational efficiency. This work provides new insights into the potential of
+CNN-based approaches for challenging image restoration tasks in underwater
+environments. The code is available at
+\href{https://github.com/CXH-Research/UIR-PolyKernel}{https://github.com/CXH-Research/UIR-PolyKernel}.
+
+
+
+ comment: Accepted by ICASSP2025
+
+
+
+
+
+
+ ☆ 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D
+ Scene Understanding
+
+
+ A 3D scene graph represents a compact scene model, storing information about
+the objects and the semantic relationships between them, making its use
+promising for robotic tasks. When interacting with a user, an embodied
+intelligent agent should be capable of responding to various queries about the
+scene formulated in natural language. Large Language Models (LLMs) are
+beneficial solutions for user-robot interaction due to their natural language
+understanding and reasoning abilities. Recent methods for creating learnable
+representations of 3D scenes have demonstrated the potential to improve the
+quality of LLMs responses by adapting to the 3D world. However, the existing
+methods do not explicitly utilize information about the semantic relationships
+between objects, limiting themselves to information about their coordinates. In
+this work, we propose a method 3DGraphLLM for constructing a learnable
+representation of a 3D scene graph. The learnable representation is used as
+input for LLMs to perform 3D vision-language tasks. In our experiments on
+popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap
+datasets, we demonstrate the advantage of this approach over baseline methods
+that do not use information about the semantic relationships between objects.
+The code is publicly available at
+https://github.com/CognitiveAISystems/3DGraphLLM.
+
+
+ Image generation in the fashion domain has predominantly focused on
+preserving body characteristics or following input prompts, but little
+attention has been paid to improving the inherent fashionability of the output
+images. This paper presents a novel diffusion model-based approach that
+generates fashion images with improved fashionability while maintaining control
+over key attributes. Key components of our method include: 1) fashionability
+enhancement, which ensures that the generated images are more fashionable than
+the input; 2) preservation of body characteristics, encouraging the generated
+images to maintain the original shape and proportions of the input; and 3)
+automatic fashion optimization, which does not rely on manual input or external
+prompts. We also employ two methods to collect training data for guidance while
+generating and evaluating the images. In particular, we rate outfit images
+using fashionability scores annotated by multiple fashion experts through
+OpenSkill-based and five critical aspect-based pairwise comparisons. These
+methods provide complementary perspectives for assessing and improving the
+fashionability of the generated images. The experimental results show that our
+approach outperforms the baseline Fashion++ in generating images with superior
+fashionability, demonstrating its effectiveness in producing more stylish and
+appealing fashion images.
+
+
+ The growing field of remote sensing faces a challenge: the ever-increasing
+size and volume of imagery data are exceeding the storage and transmission
+capabilities of satellite platforms. Efficient compression of remote sensing
+imagery is a critical solution to alleviate these burdens on satellites.
+However, existing compression methods are often too computationally expensive
+for satellites. With the continued advancement of compressed sensing theory,
+single-pixel imaging emerges as a powerful tool that brings new possibilities
+for on-orbit image compression. However, it still suffers from prolonged
+imaging times and the inability to perform high-resolution imaging, hindering
+its practical application. This paper advances the study of compressed sensing
+in remote sensing image compression, proposing Block Modulated Imaging (BMI).
+By requiring only a single exposure, BMI significantly enhances imaging
+acquisition speeds. Additionally, BMI obviates the need for digital micromirror
+devices and surpasses limitations in image resolution. Furthermore, we propose
+a novel decoding network specifically designed to reconstruct images compressed
+under the BMI framework. Leveraging the gated 3D convolutions and promoting
+efficient information flow across stages through a Two-Way Cross-Attention
+module, our decoding network exhibits demonstrably superior reconstruction
+performance. Extensive experiments conducted on multiple renowned remote
+sensing datasets unequivocally demonstrate the efficacy of our proposed method.
+To further validate its practical applicability, we developed and tested a
+prototype of the BMI-based camera, which has shown promising potential for
+on-orbit image compression. The code is available at
+https://github.com/Johnathan218/BMNet.
+
+
+
+
+
+
+
+ ☆ Re-assessing ImageNet: How aligned is its single-label assumption with
+ its multi-label nature?
+
+
+
+
+
+
+
+
+ Esla Timothy Anzaku, Seyed Amir Mousavi, Arnout Van Messem, Wesley De Neve
+
+
+ ImageNet, an influential dataset in computer vision, is traditionally
+evaluated using single-label classification, which assumes that an image can be
+adequately described by a single concept or label. However, this approach may
+not fully capture the complex semantics within the images available in
+ImageNet, potentially hindering the development of models that effectively
+learn these intricacies. This study critically examines the prevalent
+single-label benchmarking approach and advocates for a shift to multi-label
+benchmarking for ImageNet. This shift would enable a more comprehensive
+assessment of the capabilities of deep neural network (DNN) models. We analyze
+the effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of
+its variants, ImageNetV2. Studies in the literature have reported unexpected
+accuracy drops of 11% to 14% on ImageNetV2. Our findings show that these
+reported declines are largely attributable to a characteristic of the dataset
+that has not received sufficient attention -- the proportion of images with
+multiple labels. Taking this characteristic into account, the results of our
+experiments provide evidence that there is no substantial degradation in
+effectiveness on ImageNetV2. Furthermore, we acknowledge that ImageNet
+pre-trained models exhibit some capability at capturing the multi-label nature
+of the dataset even though they were trained under the single-label assumption.
+Consequently, we propose a new evaluation approach to augment existing
+approaches that assess this capability. Our findings highlight the importance
+of considering the multi-label nature of the ImageNet dataset during
+benchmarking. Failing to do so could lead to incorrect conclusions regarding
+the effectiveness of DNNs and divert research efforts from addressing other
+substantial challenges related to the reliability and robustness of these
+models.
+
+
+ Mechanobiology is gaining more and more traction as the fundamental role of
+physical forces in biological function becomes clearer. Forces at the
+microscale are often measured indirectly using inverse problems such as
+Traction Force Microscopy because biological experiments are hard to access
+with physical probes. In contrast with the experimental nature of biology and
+physics, these measurements do not come with error bars, confidence regions, or
+p-values. The aim of this manuscript is to publicize this issue and to propose
+a first step towards a remedy in the form of a general reconstruction framework
+that enables hypothesis testing.
+
+
+ Recent vision-language foundation models still frequently produce outputs
+misaligned with their inputs, evidenced by object hallucination in captioning
+and prompt misalignment in the text-to-image generation model. Recent studies
+have explored methods for identifying misaligned elements, aiming not only to
+enhance interpretability but also to improve model performance. However,
+current approaches primarily rely on large foundation models in a zero-shot
+manner or fine-tuned models with human annotations, which limits scalability
+due to significant computational costs. This work proposes a novel approach,
+dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP,
+specifically focusing on pinpointing misaligned words between image and text.
+We carefully revamp the gradient-based attribution computation method, enabling
+negative gradient of individual text tokens to indicate misalignment. We also
+propose F-CLIPScore, which aggregates misaligned attributions with a global
+alignment score. We evaluate our method on various dense misalignment detection
+benchmarks, covering various image and text domains and misalignment types. Our
+method demonstrates state-of-the-art performance among zero-shot models and
+competitive performance with fine-tuned models while maintaining superior
+efficiency. Our qualitative examples show that our method has a unique strength
+to detect entity-level objects, intangible objects, and attributes that can not
+be easily detected for existing works. We conduct ablation studies and analyses
+to highlight the strengths and limitations of our approach. Our code is
+publicly available at https://github.com/naver-ai/CLIP4DM.
+
+
+ Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach
+for high-fidelity image synthesis, operating diffusion processes on continuous
+VAE latent, which significantly differ from the text generation methods
+employed by Large Language Models (LLMs). In this paper, we introduce a novel
+generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which
+enhances the diffusion process through a recurrent token prediction mechanism,
+thereby pioneering the field of Discrete Diffusion. By progressively
+introducing Gaussian noise into the latent representations of images and
+encoding them into vector-quantized tokens in a recurrent manner, RDPM
+facilitates a unique diffusion process on discrete-value domains. This process
+iteratively predicts the token codes for subsequent timesteps, transforming the
+initial standard Gaussian noise into the source data distribution, aligning
+with GPT-style models in terms of the loss function. RDPM demonstrates superior
+performance while benefiting from the speed advantage of requiring only a few
+inference steps. This model not only leverages the diffusion process to ensure
+high-quality generation but also converts continuous signals into a series of
+high-fidelity discrete tokens, thereby maintaining a unified optimization
+strategy with other discrete tokens, such as text. We anticipate that this work
+will contribute to the development of a unified model for multimodal
+generation, specifically by integrating continuous signal domains such as
+images, videos, and audio with text. We will release the code and model weights
+to the open-source community.
+
+
+ We introduce Switch-a-View, a model that learns to automatically select the
+viewpoint to display at each timepoint when creating a how-to video. The key
+insight of our approach is how to train such a model from unlabeled--but
+human-edited--video samples. We pose a pretext task that pseudo-labels segments
+in the training videos for their primary viewpoint (egocentric or exocentric),
+and then discovers the patterns between those view-switch moments on the one
+hand and the visual and spoken content in the how-to video on the other hand.
+Armed with this predictor, our model then takes an unseen multi-view video as
+input and orchestrates which viewpoint should be displayed when. We further
+introduce a few-shot training setting that permits steering the model towards a
+new data domain. We demonstrate our idea on a variety of real-world video from
+HowTo100M and Ego-Exo4D and rigorously validate its advantages.
+
+
+ This study presents RSGaussian, an innovative novel view synthesis (NVS)
+method for aerial remote sensing scenes that incorporate LiDAR point cloud as
+constraints into the 3D Gaussian Splatting method, which ensures that Gaussians
+grow and split along geometric benchmarks, addressing the overgrowth and
+floaters issues occurs. Additionally, the approach introduces coordinate
+transformations with distortion parameters for camera models to achieve
+pixel-level alignment between LiDAR point clouds and 2D images, facilitating
+heterogeneous data fusion and achieving the high-precision geo-alignment
+required in aerial remote sensing. Depth and plane consistency losses are
+incorporated into the loss function to guide Gaussians towards real depth and
+plane representations, significantly improving depth estimation accuracy.
+Experimental results indicate that our approach has achieved novel view
+synthesis that balances photo-realistic visual quality and high-precision
+geometric estimation under aerial remote sensing datasets. Finally, we have
+also established and open-sourced a dense LiDAR point cloud dataset along with
+its corresponding aerial multi-view images, AIR-LONGYAN.
+
+
+
+
+
+
+
+ ☆ Addressing Spatial-Temporal Data Heterogeneity in Federated Continual
+ Learning via Tail Anchor
+
+
+ Federated continual learning (FCL) allows each client to continually update
+its knowledge from task streams, enhancing the applicability of federated
+learning in real-world scenarios. However, FCL needs to address not only
+spatial data heterogeneity between clients but also temporal data heterogeneity
+between tasks. In this paper, empirical experiments demonstrate that such
+input-level heterogeneity significantly affects the model's internal parameters
+and outputs, leading to severe spatial-temporal catastrophic forgetting of
+local and previous knowledge. To this end, we propose Federated Tail Anchor
+(FedTA) to mix trainable Tail Anchor with the frozen output features to adjust
+their position in the feature space, thereby overcoming parameter-forgetting
+and output-forgetting. Moreover, three novel components are also included in
+FedTA: Input Enhancement for improving the performance of pre-trained models on
+downstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous
+local knowledge on the server side; and Best Global Prototype Selection for
+finding the best anchor point for each class in the feature space. Extensive
+experiments demonstrate that FedTA not only outperforms existing FCL methods
+but also effectively preserves the relative positions of features, remaining
+unaffected by spatial and temporal changes.
+
+
+
+
+
+
+
+
+ Kunyu Peng, Di Wen, Sarfraz M. Saquib, Yufan Chen, Junwei Zheng, David Schneider, Kailun Yang, Jiamin Wu, Alina Roitberg, Rainer Stiefelhagen
+
+
+ Open-Set Domain Generalization (OSDG) is a challenging task requiring models
+to accurately predict familiar categories while minimizing confidence for
+unknown categories to effectively reject them in unseen domains. While the OSDG
+field has seen considerable advancements, the impact of label noise--a common
+issue in real-world datasets--has been largely overlooked. Label noise can
+mislead model optimization, thereby exacerbating the challenges of open-set
+recognition in novel domains. In this study, we take the first step towards
+addressing Open-Set Domain Generalization under Noisy Labels (OSDG-NL) by
+constructing dedicated benchmarks derived from widely used OSDG datasets,
+including PACS and DigitsDG. We evaluate baseline approaches by integrating
+techniques from both label denoising and OSDG methodologies, highlighting the
+limitations of existing strategies in handling label noise effectively. To
+address these limitations, we propose HyProMeta, a novel framework that
+integrates hyperbolic category prototypes for label noise-aware meta-learning
+alongside a learnable new-category agnostic prompt designed to enhance
+generalization to unseen classes. Our extensive experiments demonstrate the
+superior performance of HyProMeta compared to state-of-the-art methods across
+the newly established benchmarks. The source code of this work is released at
+https://github.com/KPeng9510/HyProMeta.
+
+
+
+ comment: The source code of this work is released at
+ https://github.com/KPeng9510/HyProMeta
+
+ Humans naturally rely on floor plans to navigate in unfamiliar environments,
+as they are readily available, reliable, and provide rich geometrical guidance.
+However, existing visual navigation settings overlook this valuable prior
+knowledge, leading to limited efficiency and accuracy. To eliminate this gap,
+we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the
+first attempt to incorporate floor plan into embodied visual navigation. While
+the floor plan offers significant advantages, two key challenges emerge: (1)
+handling the spatial inconsistency between the floor plan and the actual scene
+layout for collision-free navigation, and (2) aligning observed images with the
+floor plan sketch despite their distinct modalities. To address these
+challenges, we propose FloDiff, a novel diffusion policy framework
+incorporating a localization module to facilitate alignment between the current
+observation and the floor plan. We further collect $20k$ navigation episodes
+across $117$ scenes in the iGibson simulator to support the training and
+evaluation. Extensive experiments demonstrate the effectiveness and efficiency
+of our framework in unfamiliar scenes using floor plan knowledge. Project
+website: https://gauleejx.github.io/flona/.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ HAUR: Human Annotation Understanding and Recognition Through Text-Heavy
+ Images
+
+
+ Vision Question Answering (VQA) tasks use images to convey critical
+information to answer text-based questions, which is one of the most common
+forms of question answering in real-world scenarios. Numerous vision-text
+models exist today and have performed well on certain VQA tasks. However, these
+models exhibit significant limitations in understanding human annotations on
+text-heavy images. To address this, we propose the Human Annotation
+Understanding and Recognition (HAUR) task. As part of this effort, we introduce
+the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which
+encompasses five common types of human annotations. Additionally, we developed
+and trained our model, OCR-Mix. Through comprehensive cross-model comparisons,
+our results demonstrate that OCR-Mix outperforms other models in this task. Our
+dataset and model will be released soon .
+
+
+ This study mainly explores the application of natural gesture recognition
+based on computer vision in human-computer interaction, aiming to improve the
+fluency and naturalness of human-computer interaction through gesture
+recognition technology. In the fields of virtual reality, augmented reality and
+smart home, traditional input methods have gradually failed to meet the needs
+of users for interactive experience. As an intuitive and convenient interaction
+method, gestures have received more and more attention. This paper proposes a
+gesture recognition method based on a three-dimensional hand skeleton model. By
+simulating the three-dimensional spatial distribution of hand joints, a
+simplified hand skeleton structure is constructed. By connecting the palm and
+each finger joint, a dynamic and static gesture model of the hand is formed,
+which further improves the accuracy and efficiency of gesture recognition.
+Experimental results show that this method can effectively recognize various
+gestures and maintain high recognition accuracy and real-time response
+capabilities in different environments. In addition, combined with multimodal
+technologies such as eye tracking, the intelligence level of the gesture
+recognition system can be further improved, bringing a richer and more
+intuitive user experience. In the future, with the continuous development of
+computer vision, deep learning and multimodal interaction technology, natural
+interaction based on gestures will play an important role in a wider range of
+application scenarios and promote revolutionary progress in human-computer
+interaction.
+
+
+
+
+
+
+
+ ☆ Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
+ Collective Monte Carlo Tree Search
+
+
+ In this work, we aim to develop an MLLM that understands and solves questions
+by learning to create each intermediate step of the reasoning involved till the
+final answer. To this end, we propose Collective Monte Carlo Tree Search
+(CoMCTS), a new learning-to-reason method for MLLMs, which introduces the
+concept of collective learning into ``tree search'' for effective and efficient
+reasoning-path searching and learning. The core idea of CoMCTS is to leverage
+collective knowledge from multiple models to collaboratively conjecture, search
+and identify effective reasoning paths toward correct answers via four
+iterative operations including Expansion, Simulation and Error Positioning,
+Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a
+multimodal dataset with a tree of rich, explicit and well-defined reasoning
+nodes for each question. With Mulberry-260k, we perform collective SFT to train
+our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and
+Reflection capabilities. Extensive experiments demonstrate the superiority of
+our proposed methods on various benchmarks. Code will be available at
+https://github.com/HJYao00/Mulberry
+
+
+
+ comment: Technical report
+
+
+
+
+
+
+ ☆ Efficient and Context-Aware Label Propagation for Zero-/Few-Shot
+ Training-Free Adaptation of Vision-Language Model
+
+
+
+
+
+
+
+
+ Yushu Li, Yongyi Su, Adam Goodge, Kui Jia, Xun Xu
+
+
+ Vision-language models (VLMs) have revolutionized machine learning by
+leveraging large pre-trained models to tackle various downstream tasks. Despite
+improvements in label, training, and data efficiency, many state-of-the-art
+VLMs still require task-specific hyperparameter tuning and fail to fully
+exploit test samples. To overcome these challenges, we propose a graph-based
+approach for label-efficient adaptation and inference. Our method dynamically
+constructs a graph over text prompts, few-shot examples, and test samples,
+using label propagation for inference without task-specific tuning. Unlike
+existing zero-shot label propagation techniques, our approach requires no
+additional unlabeled support set and effectively leverages the test sample
+manifold through dynamic graph expansion. We further introduce a context-aware
+feature re-weighting mechanism to improve task adaptation accuracy.
+Additionally, our method supports efficient graph expansion, enabling real-time
+inductive inference. Extensive evaluations on downstream tasks, such as
+fine-grained categorization and out-of-distribution generalization, demonstrate
+the effectiveness of our approach.
+
+
+
+
+
+
+
+
+ Jaechul Roh, Andrew Yuan, Jinsong Mao
+
+
+ Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the
+generation of high-quality images that align closely with textual descriptions.
+However, this progress has also raised concerns about their misuse for
+propaganda and other malicious activities. Recent studies reveal that attackers
+can embed biases into these models through simple fine-tuning, causing them to
+generate targeted imagery when triggered by specific phrases. This underscores
+the potential for T2I models to act as tools for disseminating propaganda,
+producing images aligned with an attacker's objective for end-users.
+ Building on this concept, we introduce FameBias, a T2I biasing attack that
+manipulates the embeddings of input prompts to generate images featuring
+specific public figures. Unlike prior methods, Famebias operates solely on the
+input embedding vectors without requiring additional model training. We
+evaluate FameBias comprehensively using Stable Diffusion V2, generating a large
+corpus of images based on various trigger nouns and target public figures. Our
+experiments demonstrate that FameBias achieves a high attack success rate while
+preserving the semantic context of the original prompts across multiple
+trigger-target pairs.
+
+
+
+
+
+
+
+ ☆ Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight
+
+
+ Video anomaly detection (VAD) has witnessed significant advancements through
+the integration of large language models (LLMs) and vision-language models
+(VLMs), addressing critical challenges such as interpretability, temporal
+reasoning, and generalization in dynamic, open-world scenarios. This paper
+presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024,
+focusing on four key aspects: (i) enhancing interpretability through semantic
+insights and textual explanations, making visual anomalies more understandable;
+(ii) capturing intricate temporal relationships to detect and localize dynamic
+anomalies across video frames; (iii) enabling few-shot and zero-shot detection
+to minimize reliance on large, annotated datasets; and (iv) addressing
+open-world and class-agnostic anomalies by using semantic understanding and
+motion features for spatiotemporal coherence. We highlight their potential to
+redefine the landscape of VAD. Additionally, we explore the synergy between
+visual and textual modalities offered by LLMs and VLMs, highlighting their
+combined strengths and proposing future directions to fully exploit the
+potential in enhancing video anomaly detection.
+
+
+
+ comment: Research report
+
+
+
+
+
+
+ ☆ Towards understanding how attention mechanism works in deep learning
+
+
+ Attention mechanism has been extensively integrated within mainstream neural
+network architectures, such as Transformers and graph attention networks. Yet,
+its underlying working principles remain somewhat elusive. What is its essence?
+Are there any connections between it and traditional machine learning
+algorithms? In this study, we inspect the process of computing similarity using
+classic metrics and vector space properties in manifold learning, clustering,
+and supervised learning. We identify the key characteristics of similarity
+computation and information propagation in these methods and demonstrate that
+the self-attention mechanism in deep learning adheres to the same principles
+but operates more flexibly and adaptively. We decompose the self-attention
+mechanism into a learnable pseudo-metric function and an information
+propagation process based on similarity computation. We prove that the
+self-attention mechanism converges to a drift-diffusion process through
+continuous modeling provided the pseudo-metric is a transformation of a metric
+and certain reasonable assumptions hold. This equation could be transformed
+into a heat equation under a new metric. In addition, we give a first-order
+analysis of attention mechanism with a general pseudo-metric function. This
+study aids in understanding the effects and principle of attention mechanism
+through physical intuition. Finally, we propose a modified attention mechanism
+called metric-attention by leveraging the concept of metric learning to
+facilitate the ability to learn desired metrics more effectively. Experimental
+results demonstrate that it outperforms self-attention regarding training
+efficiency, accuracy, and robustness.
+
+
+
+
+
+
+
+
+ Zihan Ye, Xinyuan Ru, Shiming Chen, Yaochu Jin, Kaizhu Huang, Xiaobo Jin
+
+
+ Feature Generative Adversarial Networks have emerged as powerful generative
+models in producing high-quality representations of unseen classes within the
+scope of Zero-shot Learning (ZSL). This paper delves into the pivotal influence
+of unseen class priors within the framework of transductive ZSL (TZSL) and
+illuminates the finding that even a marginal prior bias can result in
+substantial accuracy declines. Our extensive analysis uncovers that this
+inefficacy fundamentally stems from the utilization of an unconditional unseen
+discriminator - a core component in existing TZSL. We further establish that
+the detrimental effects of this component are inevitable unless the generator
+perfectly fits class-specific distributions. Building on these insights, we
+introduce our Improved Feature Generation Framework, termed I-VAEGAN, which
+incorporates two novel components: Pseudo-conditional Feature Adversarial (PFA)
+learning and Variational Embedding Regression (VER). PFA circumvents the need
+for prior estimation by explicitly injecting the predicted semantics as pseudo
+conditions for unseen classes premised by precise semantic regression.
+Meanwhile, VER utilizes reconstructive pre-training to learn class statistics,
+obtaining better semantic regression. Our I-VAEGAN achieves state-of-the-art
+TZSL accuracy across various benchmarks and priors. Our code would be released
+upon acceptance.
+
+
+
+
+
+
+
+ ☆ Towards Modality Generalization: A Benchmark and Prospective Analysis
+
+
+ Multi-modal learning has achieved remarkable success by integrating
+information from various modalities, achieving superior performance in tasks
+like recognition and retrieval compared to uni-modal approaches. However,
+real-world scenarios often present novel modalities that are unseen during
+training due to resource and privacy constraints, a challenge current methods
+struggle to address. This paper introduces Modality Generalization (MG), which
+focuses on enabling models to generalize to unseen modalities. We define two
+cases: weak MG, where both seen and unseen modalities can be mapped into a
+joint embedding space via existing perceptors, and strong MG, where no such
+mappings exist. To facilitate progress, we propose a comprehensive benchmark
+featuring multi-modal algorithms and adapt existing methods that focus on
+generalization. Extensive experiments highlight the complexity of MG, exposing
+the limitations of existing methods and identifying key directions for future
+research. Our work provides a foundation for advancing robust and adaptable
+multi-modal models, enabling them to handle unseen modalities in realistic
+scenarios.
+
+
+
+
+
+
+
+ ☆ UNet--: Memory-Efficient and Feature-Enhanced Network Architecture based
+ on U-Net with Reduced Skip-Connections ACCV2024
+
+
+ U-Net models with encoder, decoder, and skip-connections components have
+demonstrated effectiveness in a variety of vision tasks. The skip-connections
+transmit fine-grained information from the encoder to the decoder. It is
+necessary to maintain the feature maps used by the skip-connections in memory
+before the decoding stage. Therefore, they are not friendly to devices with
+limited resource. In this paper, we propose a universal method and architecture
+to reduce the memory consumption and meanwhile generate enhanced feature maps
+to improve network performance. To this end, we design a simple but effective
+Multi-Scale Information Aggregation Module (MSIAM) in the encoder and an
+Information Enhancement Module (IEM) in the decoder. The MSIAM aggregates
+multi-scale feature maps into single-scale with less memory. After that, the
+aggregated feature maps can be expanded and enhanced to multi-scale feature
+maps by the IEM. By applying the proposed method on NAFNet, a SOTA model in the
+field of image restoration, we design a memory-efficient and feature-enhanced
+network architecture, UNet--. The memory demand by the skip-connections in the
+UNet-- is reduced by 93.3%, while the performance is improved compared to
+NAFNet. Furthermore, we show that our proposed method can be generalized to
+multiple visual tasks, with consistent improvements in both memory consumption
+and network accuracy compared to the existing efficient architectures.
+
+
+
+ comment: 17 pages, 7 figures, accepted by ACCV2024
+
+
+
+
+
+
+ ☆ Sampling Bag of Views for Open-Vocabulary Object Detection
+
+
+ Existing open-vocabulary object detection (OVD) develops methods for testing
+unseen categories by aligning object region embeddings with corresponding VLM
+features. A recent study leverages the idea that VLMs implicitly learn
+compositional structures of semantic concepts within the image. Instead of
+using an individual region embedding, it utilizes a bag of region embeddings as
+a new representation to incorporate compositional structures into the OVD task.
+However, this approach often fails to capture the contextual concepts of each
+region, leading to noisy compositional structures. This results in only
+marginal performance improvements and reduced efficiency. To address this, we
+propose a novel concept-based alignment method that samples a more powerful and
+efficient compositional structure. Our approach groups contextually related
+``concepts'' into a bag and adjusts the scale of concepts within the bag for
+more effective embedding alignment. Combined with Faster R-CNN, our method
+achieves improvements of 2.6 box AP50 and 0.5 mask AP over prior work on novel
+categories in the open-vocabulary COCO and LVIS benchmarks. Furthermore, our
+method reduces CLIP computation in FLOPs by 80.3% compared to previous
+research, significantly enhancing efficiency. Experimental results demonstrate
+that the proposed method outperforms previous state-of-the-art models on the
+OVD datasets.
+
+
+
+ comment: 19 pages
+
+
+
+
+
+
+ ☆ AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic
+ Segmentation via Adaptive Label Correction AAAI
+
+
+ Recently, Visual Foundation Models (VFMs) have shown a remarkable
+generalization performance in 3D perception tasks. However, their effectiveness
+in large-scale outdoor datasets remains constrained by the scarcity of accurate
+supervision signals, the extensive noise caused by variable outdoor conditions,
+and the abundance of unknown objects. In this work, we propose a novel
+label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic
+segmentation. AdaCo first introduces the Cross-modal Label Generation Module
+(CLGM), providing cross-modal supervision with the formidable interpretive
+capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise
+Corrector (ANC), updating and adjusting the noisy samples within this
+supervision iteratively during training. Moreover, we develop an Adaptive
+Robust Loss (ARL) function to modulate each sample's sensitivity to noisy
+supervision, preventing potential underfitting issues associated with robust
+loss. Our proposed AdaCo can effectively mitigate the performance limitations
+of label-free learning networks in 3D semantic segmentation tasks. Extensive
+experiments on two outdoor benchmark datasets highlight the superior
+performance of our method.
+
+
+ Multimodal fake news detection aims to automatically identify real or fake
+news, thereby mitigating the adverse effects caused by such misinformation.
+Although prevailing approaches have demonstrated their effectiveness,
+challenges persist in cross-modal feature fusion and refinement for
+classification. To address this, we present a residual-aware compensation
+network with multi-granularity constraints (RaCMC) for fake news detection,
+that aims to sufficiently interact and fuse cross-modal features while
+amplifying the differences between real and fake news. First, a multiscale
+residual-aware compensation module is designed to interact and fuse features at
+different scales, and ensure both the consistency and exclusivity of feature
+interaction, thus acquiring high-quality features. Second, a multi-granularity
+constraints module is implemented to limit the distribution of both the news
+overall and the image-text pairs within the news, thus amplifying the
+differences between real and fake news at the news and feature levels. Finally,
+a dominant feature fusion reasoning module is developed to comprehensively
+evaluate news authenticity from the perspectives of both consistency and
+inconsistency. Experiments on three public datasets, including Weibo17,
+Politifact and GossipCop, reveal the superiority of the proposed method.
+
+
+
+ comment: 9 pages, 4 figures
+
+
+
+
+
+
+ ☆ An Improved Fault Diagnosis Strategy for Induction Motors Using Weighted
+ Probability Ensemble Deep Learning
+
+
+ Early detection of faults in induction motors is crucial for ensuring
+uninterrupted operations in industrial settings. Among the various fault types
+encountered in induction motors, bearing, rotor, and stator faults are the most
+prevalent. This paper introduces a Weighted Probability Ensemble Deep Learning
+(WPEDL) methodology, tailored for effectively diagnosing induction motor faults
+using high-dimensional data extracted from vibration and current features. The
+Short-Time Fourier Transform (STFT) is employed to extract features from both
+vibration and current signals. The performance of the WPEDL fault diagnosis
+method is compared against conventional deep learning models, demonstrating the
+superior efficacy of the proposed system. The multi-class fault diagnosis
+system based on WPEDL achieves high accuracies across different fault types:
+99.05% for bearing (vibrational signal), 99.10%, and 99.50% for rotor (current
+and vibration signal), and 99.60%, and 99.52% for stator faults (current and
+vibration signal) respectively. To evaluate the robustness of our multi-class
+classification decisions, tests have been conducted on a combined dataset of
+52,000 STFT images encompassing all three faults. Our proposed model
+outperforms other models, achieving an accuracy of 98.89%. The findings
+underscore the effectiveness and reliability of the WPEDL approach for
+early-stage fault diagnosis in IMs, offering promising insights for enhancing
+industrial operational efficiency and reliability.
+
+
+
+
+
+
+
+ ☆ Band Prompting Aided SAR and Multi-Spectral Data Fusion Framework for
+ Local Climate Zone Classification ICASSP 2025
+
+
+ Local climate zone (LCZ) classification is of great value for understanding
+the complex interactions between urban development and local climate. Recent
+studies have increasingly focused on the fusion of synthetic aperture radar
+(SAR) and multi-spectral data to improve LCZ classification performance.
+However, it remains challenging due to the distinct physical properties of
+these two types of data and the absence of effective fusion guidance. In this
+paper, a novel band prompting aided data fusion framework is proposed for LCZ
+classification, namely BP-LCZ, which utilizes textual prompts associated with
+band groups to guide the model in learning the physical attributes of different
+bands and semantics of various categories inherent in SAR and multi-spectral
+data to augment the fused feature, thus enhancing LCZ classification
+performance. Specifically, a band group prompting (BGP) strategy is introduced
+to align the visual representation effectively at the level of band groups,
+which also facilitates a more adequate extraction of semantic information of
+different bands with textual information. In addition, a multivariate
+supervised matrix (MSM) based training strategy is proposed to alleviate the
+problem of positive and negative sample confusion by completing the supervised
+information. The experimental results demonstrate the effectiveness and
+superiority of the proposed data fusion framework.
+
+
+
+
+
+
+
+
+ Jiaqi Wu, Shihao Zhang, Simin Chen, Lixu Wang, Zehua Wang, Wei Chen, Fangyuan He, Zijian Tian, F. Richard Yu, Victor C. M. Leung
+
+
+ Edge computing has emerged as a key paradigm for deploying deep
+learning-based object detection in time-sensitive scenarios. However, existing
+edge detection methods face challenges: 1) difficulty balancing detection
+precision with lightweight models, 2) limited adaptability of generalized
+deployment designs, and 3) insufficient real-world validation. To address these
+issues, we propose the Edge Detection Toolbox (ED-TOOLBOX), which utilizes
+generalizable plug-and-play components to adapt object detection models for
+edge environments. Specifically, we introduce a lightweight Reparameterized
+Dynamic Convolutional Network (Rep-DConvNet) featuring weighted multi-shape
+convolutional branches to enhance detection performance. Additionally, we
+design a Sparse Cross-Attention (SC-A) network with a
+localized-mapping-assisted self-attention mechanism, enabling a well-crafted
+joint module for adaptive feature transfer. For real-world applications, we
+incorporate an Efficient Head into the YOLO framework to accelerate edge model
+optimization. To demonstrate practical impact, we identify a gap in helmet
+detection -- overlooking band fastening, a critical safety factor -- and create
+the Helmet Band Detection Dataset (HBDD). Using ED-TOOLBOX-optimized models, we
+address this real-world task. Extensive experiments validate the effectiveness
+of ED-TOOLBOX, with edge detection models outperforming six state-of-the-art
+methods in visual surveillance simulations, achieving real-time and accurate
+performance. These results highlight ED-TOOLBOX as a superior solution for edge
+object detection.
+
+
+
+
+
+
+
+ ☆ Expand VSR Benchmark for VLLM to Expertize in Spatial Rules
+
+
+ Distinguishing spatial relations is a basic part of human cognition which
+requires fine-grained perception on cross-instance. Although benchmarks like
+MME, MMBench and SEED comprehensively have evaluated various capabilities which
+already include visual spatial reasoning(VSR). There is still a lack of
+sufficient quantity and quality evaluation and optimization datasets for Vision
+Large Language Models(VLLMs) specifically targeting visual positional
+reasoning. To handle this, we first diagnosed current VLLMs with the VSR
+dataset and proposed a unified test set. We found current VLLMs to exhibit a
+contradiction of over-sensitivity to language instructions and
+under-sensitivity to visual positional information. By expanding the original
+benchmark from two aspects of tunning data and model structure, we mitigated
+this phenomenon. To our knowledge, we expanded spatially positioned image data
+controllably using diffusion models for the first time and integrated original
+visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and
+DINO). After conducting combination experiments on scaling data and models, we
+obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different
+instructions but also accurately distinguishes differences in visual positional
+information. VSRE achieved over a 27\% increase in accuracy on the VSR test
+set. It becomes a performant VLLM on the position reasoning of both the VSR
+dataset and relevant subsets of other evaluation benchmarks. We open-sourced
+the expanded model with data and Appendix at
+\url{https://github.com/peijin360/vsre} and hope it will accelerate
+advancements in VLLM on VSR learning.
+
+
+
+
+
+
+
+ ☆ GIMS: Image Matching System Based on Adaptive Graph Construction and
+ Graph Neural Network
+
+
+
+
+
+
+
+
+ Xianfeng Song, Yi Zou, Zheng Shi, Zheng Liu
+
+
+ Feature-based image matching has extensive applications in computer vision.
+Keypoints detected in images can be naturally represented as graph structures,
+and Graph Neural Networks (GNNs) have been shown to outperform traditional deep
+learning techniques. Consequently, the paradigm of image matching via GNNs has
+gained significant prominence in recent academic research. In this paper, we
+first introduce an innovative adaptive graph construction method that utilizes
+a filtering mechanism based on distance and dynamic threshold similarity. This
+method dynamically adjusts the criteria for incorporating new vertices based on
+the characteristics of existing vertices, allowing for the construction of more
+precise and robust graph structures while avoiding redundancy. We further
+combine the vertex processing capabilities of GNNs with the global awareness
+capabilities of Transformers to enhance the model's representation of spatial
+and feature information within graph structures. This hybrid model provides a
+deeper understanding of the interrelationships between vertices and their
+contributions to the matching process. Additionally, we employ the Sinkhorn
+algorithm to iteratively solve for optimal matching results. Finally, we
+validate our system using extensive image datasets and conduct comprehensive
+comparative experiments. Experimental results demonstrate that our system
+achieves an average improvement of 3.8x-40.3x in overall matching performance.
+Additionally, the number of vertices and edges significantly impacts training
+efficiency and memory usage; therefore, we employ multi-GPU technology to
+accelerate the training process. Our code is available at
+https://github.com/songxf1024/GIMS.
+
+
+
+
+
+
+
+ ☆ Adapter Merging with Centroid Prototype Mapping for Scalable
+ Class-Incremental Learning
+
+
+ We propose Adapter Merging with Centroid Prototype Mapping (ACMap), an
+exemplar-free framework for class-incremental learning (CIL) that addresses
+both catastrophic forgetting and scalability. While existing methods trade-off
+between inference time and accuracy, ACMap consolidates task-specific adapters
+into a single adapter, ensuring constant inference time across tasks without
+compromising accuracy. The framework employs adapter merging to build a shared
+subspace that aligns task representations and mitigates forgetting, while
+centroid prototype mapping maintains high accuracy through consistent
+adaptation in the shared subspace. To further improve scalability, an early
+stopping strategy limits adapter merging as tasks increase. Extensive
+experiments on five benchmark datasets demonstrate that ACMap matches
+state-of-the-art accuracy while maintaining inference time comparable to the
+fastest existing methods. The code is available at
+https://github.com/tf63/ACMap
+
+
+ Controversial contents largely inundate the Internet, infringing various
+cultural norms and child protection standards. Traditional Image Content
+Moderation (ICM) models fall short in producing precise moderation decisions
+for diverse standards, while recent multimodal large language models (MLLMs),
+when adopted to general rule-based ICM, often produce classification and
+explanation results that are inconsistent with human moderators. Aiming at
+flexible, explainable, and accurate ICM, we design a novel rule-based dataset
+generation pipeline, decomposing concise human-defined rules and leveraging
+well-designed multi-stage prompts to enrich short explicit image annotations.
+Our ICM-Instruct dataset includes detailed moderation explanation and
+moderation Q-A pairs. Built upon it, we create our ICM-Assistant model in the
+framework of rule-based ICM, making it readily applicable in real practice. Our
+ICM-Assistant model demonstrates exceptional performance and flexibility.
+Specifically, it significantly outperforms existing approaches on various
+sources, improving both the moderation classification (36.8\% on average) and
+moderation explanation quality (26.6\% on average) consistently over existing
+MLLMs. Code/Data is available at https://github.com/zhaoyuzhi/ICM-Assistant.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+ ☆ SDM-Car: A Dataset for Small and Dim Moving Vehicles Detection in
+ Satellite Videos
+
+
+
+
+
+
+
+
+ Zhen Zhang, Tao Peng, Liang Liao, Jing Xiao, Mi Wang
+
+
+ Vehicle detection and tracking in satellite video is essential in remote
+sensing (RS) applications. However, upon the statistical analysis of existing
+datasets, we find that the dim vehicles with low radiation intensity and
+limited contrast against the background are rarely annotated, which leads to
+the poor effect of existing approaches in detecting moving vehicles under low
+radiation conditions. In this paper, we address the challenge by building a
+\textbf{S}mall and \textbf{D}im \textbf{M}oving Cars (SDM-Car) dataset with a
+multitude of annotations for dim vehicles in satellite videos, which is
+collected by the Luojia 3-01 satellite and comprises 99 high-quality videos.
+Furthermore, we propose a method based on image enhancement and attention
+mechanisms to improve the detection accuracy of dim vehicles, serving as a
+benchmark for evaluating the dataset. Finally, we assess the performance of
+several representative methods on SDM-Car and present insightful findings. The
+dataset is openly available at https://github.com/TanedaM/SDM-Car.
+
+
+ In competitive combat sports like boxing, analyzing a boxers's performance
+statics is crucial for evaluating the quantity and variety of punches delivered
+during bouts. These statistics provide valuable data and feedback, which are
+routinely used for coaching and performance enhancement. We introduce BoxMAC, a
+real-world boxing dataset featuring 15 professional boxers and encompassing 13
+distinct action labels. Comprising over 60,000 frames, our dataset has been
+meticulously annotated for multiple actions per frame with inputs from a boxing
+coach. Since two boxers can execute different punches within a single
+timestamp, this problem falls under the domain of multi-label action
+classification. We propose a novel architecture for jointly recognizing
+multiple actions in both individual images and videos. We investigate baselines
+using deep neural network architectures to address both tasks. We believe that
+BoxMAC will enable researchers and practitioners to develop and evaluate more
+efficient models for performance analysis. With its realistic and diverse
+nature, BoxMAC can serve as a valuable resource for the advancement of boxing
+as a sport
+
+
+
+ comment: 10 pages, 8 figures
+
+
+
+
+
+
+ ☆ Leveraging Deep Learning with Multi-Head Attention for Accurate
+ Extraction of Medicine from Handwritten Prescriptions
+
+
+
+
+
+
+
+
+ Usman Ali, Sahil Ranmbail, Muhammad Nadeem, Hamid Ishfaq, Muhammad Umer Ramzan, Waqas Ali
+
+
+ Extracting medication names from handwritten doctor prescriptions is
+challenging due to the wide variability in handwriting styles and prescription
+formats. This paper presents a robust method for extracting medicine names
+using a combination of Mask R-CNN and Transformer-based Optical Character
+Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A
+novel dataset, featuring diverse handwritten prescriptions from various regions
+of Pakistan, was utilized to fine-tune the model on different handwriting
+styles. The Mask R-CNN model segments the prescription images to focus on the
+medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and
+Positional Embeddings, transcribes the isolated text. The transcribed text is
+then matched against a pre-existing database for accurate identification. The
+proposed approach achieved a character error rate (CER) of 1.4% on standard
+benchmarks, highlighting its potential as a reliable and efficient tool for
+automating medicine name extraction.
+
+
+
+
+
+
+
+ ☆ VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics
+ Manipulation with Long-Horizon Reasoning Tasks
+
+
+ General-purposed embodied agents are designed to understand the users'
+natural instructions or intentions and act precisely to complete universal
+tasks. Recently, methods based on foundation models especially
+Vision-Language-Action models (VLAs) have shown a substantial potential to
+solve language-conditioned manipulation (LCM) tasks well. However, existing
+benchmarks do not adequately meet the needs of VLAs and relative algorithms. To
+better define such general-purpose tasks in the context of LLMs and advance the
+research in VLAs, we present VLABench, an open-source benchmark for evaluating
+universal LCM task learning. VLABench provides 100 carefully designed
+categories of tasks, with strong randomization in each category of task and a
+total of 2000+ objects. VLABench stands out from previous benchmarks in four
+key aspects: 1) tasks requiring world knowledge and common sense transfer, 2)
+natural language instructions with implicit human intentions rather than
+templates, 3) long-horizon tasks demanding multi-step reasoning, and 4)
+evaluation of both action policies and language model capabilities. The
+benchmark assesses multiple competencies including understanding of
+mesh\&texture, spatial relationship, semantic instruction, physical laws,
+knowledge transfer and reasoning, etc. To support the downstream finetuning, we
+provide high-quality training data collected via an automated framework
+incorporating heuristic skills and prior information. The experimental results
+indicate that both the current state-of-the-art pretrained VLAs and the
+workflow based on VLMs face challenges in our tasks.
+
+
+
+
+
+
+
+
+ Yucong Luo, Mingyue Cheng, Jie Ouyang, Xiaoyu Tao, Qi Liu
+
+
+ Text-to-image generative models excel in creating images from text but
+struggle with ensuring alignment and consistency between outputs and prompts.
+This paper introduces TextMatch, a novel framework that leverages multimodal
+optimization to address image-text discrepancies in text-to-image (T2I)
+generation and editing. TextMatch employs a scoring strategy powered by large
+language models (LLMs) and visual question-answering (VQA) models to evaluate
+semantic consistency between prompts and generated images. By integrating
+multimodal in-context learning and chain of thought reasoning, our method
+dynamically refines prompts through iterative optimization. This process
+ensures that the generated images better capture user intent of, resulting in
+higher fidelity and relevance. Extensive experiments demonstrate that TextMatch
+significantly improves text-image consistency across multiple benchmarks,
+establishing a reliable framework for advancing the capabilities of
+text-to-image generative models. Our code is available at
+https://anonymous.4open.science/r/TextMatch-F55C/.
+
+
+
+
+
+
+
+ ☆ VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis
+
+
+
+
+
+
+
+
+ Shicheng Yin, Kaixuan Yin, Weixing Chen, Enbo Huang, Yang Liu
+
+
+ Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two
+dominant models for image analysis. While CNNs excel at extracting multi-scale
+features and ViTs effectively capture global dependencies, both suffer from
+high computational costs, particularly when processing high-resolution images.
+Recently, state-space models (SSMs) and recurrent neural networks (RNNs) have
+attracted attention due to their efficiency. However, their performance in
+image classification tasks remains limited. To address these challenges, this
+paper introduces VisionGRU, a novel RNN-based architecture designed for
+efficient image classification. VisionGRU leverages a simplified Gated
+Recurrent Unit (minGRU) to process large-scale image features with linear
+complexity. It divides images into smaller patches and progressively reduces
+the sequence length while increasing the channel depth, thus facilitating
+multi-scale feature extraction. A hierarchical 2DGRU module with bidirectional
+scanning captures both local and global contexts, improving long-range
+dependency modeling, particularly for tasks like semantic segmentation.
+Experimental results on the ImageNet and ADE20K datasets demonstrate that
+VisionGRU outperforms ViTs, significantly reducing memory usage and
+computational costs, especially for high-resolution images. These findings
+underscore the potential of RNN-based approaches for developing efficient and
+scalable computer vision solutions. Codes will be available at
+https://github.com/YangLiu9208/VisionGRU.
+
+
+
+ comment: Codes will be available at https://github.com/YangLiu9208/VisionGRU
+
+
+
+
+
+
+ ☆ Enhancing Online Continual Learning with Plug-and-Play State Space Model
+ and Class-Conditional Mixture of Discretization
+
+
+
+
+
+
+
+
+ Sihao Liu, Yibo Yang, Xiaojie Li, David A. Clifton, Bernard Ghanem
+
+
+ Online continual learning (OCL) seeks to learn new tasks from data streams
+that appear only once, while retaining knowledge of previously learned tasks.
+Most existing methods rely on replay, focusing on enhancing memory retention
+through regularization or distillation. However, they often overlook the
+adaptability of the model, limiting the ability to learn generalizable and
+discriminative features incrementally from online training data. To address
+this, we introduce a plug-and-play module, S6MOD, which can be integrated into
+most existing methods and directly improve adaptability. Specifically, S6MOD
+introduces an extra branch after the backbone, where a mixture of
+discretization selectively adjusts parameters in a selective state space model,
+enriching selective scan patterns such that the model can adaptively select the
+most sensitive discretization method for current dynamics. We further design a
+class-conditional routing algorithm for dynamic, uncertainty-based adjustment
+and implement a contrastive discretization loss to optimize it. Extensive
+experiments combining our module with various models demonstrate that S6MOD
+significantly enhances model adaptability, leading to substantial performance
+gains and achieving the state-of-the-art results.
+
+
+
+
+
+
+
+ ☆ Parallel Neural Computing for Scene Understanding from LiDAR Perception
+ in Autonomous Racing
+
+
+ Autonomous driving in high-speed racing, as opposed to urban environments,
+presents significant challenges in scene understanding due to rapid changes in
+the track environment. Traditional sequential network approaches may struggle
+to meet the real-time knowledge and decision-making demands of an autonomous
+agent covering large displacements in a short time. This paper proposes a novel
+baseline architecture for developing sophisticated models capable of true
+hardware-enabled parallelism, achieving neural processing speeds that mirror
+the agent's high velocity. The proposed model (Parallel Perception Network
+(PPN)) consists of two independent neural networks, segmentation and
+reconstruction networks, running parallelly on separate accelerated hardware.
+The model takes raw 3D point cloud data from the LiDAR sensor as input and
+converts it into a 2D Bird's Eye View Map on both devices. Each network
+independently extracts its input features along space and time dimensions and
+produces outputs parallelly. The proposed method's model is trained on a system
+with two NVIDIA T4 GPUs, using a combination of loss functions, including edge
+preservation, and demonstrates a 2x speedup in model inference time compared to
+a sequential configuration. Implementation is available at:
+https://github.com/suwesh/Parallel-Perception-Network. Learned parameters of
+the trained networks are provided at:
+https://huggingface.co/suwesh/ParallelPerceptionNetwork.
+
+
+
+ comment: IEEE/ISED 2024
+
+
+
+
+
+
+ ☆ Image Quality Assessment: Exploring Regional Heterogeneity via Response
+ of Adaptive Multiple Quality Factors in Dictionary Space
+
+
+ Given that the factors influencing image quality vary significantly with
+scene, content, and distortion type, particularly in the context of regional
+heterogeneity, we propose an adaptive multi-quality factor (AMqF) framework to
+represent image quality in a dictionary space, enabling the precise capture of
+quality features in non-uniformly distorted regions. By designing an adapter,
+the framework can flexibly decompose quality factors (such as brightness,
+structure, contrast, etc.) that best align with human visual perception and
+quantify them into discrete visual words. These visual words respond to the
+constructed dictionary basis vector, and by obtaining the corresponding
+coordinate vectors, we can measure visual similarity. Our method offers two key
+contributions. First, an adaptive mechanism that extracts and decomposes
+quality factors according to human visual perception principles enhances their
+representation ability through reconstruction constraints. Second, the
+construction of a comprehensive and discriminative dictionary space and basis
+vector allows quality factors to respond effectively to the dictionary basis
+vector and capture non-uniform distortion patterns in images, significantly
+improving the accuracy of visual similarity measurement. The experimental
+results demonstrate that the proposed method outperforms existing
+state-of-the-art approaches in handling various types of distorted images. The
+source code is available at https://anonymous.4open.science/r/AMqF-44B2.
+
+
+
+
+
+
+
+ ☆ Semantics Disentanglement and Composition for Versatile Codec toward
+ both Human-eye Perception and Machine Vision Task
+
+
+ While learned image compression methods have achieved impressive results in
+either human visual perception or machine vision tasks, they are often
+specialized only for one domain. This drawback limits their versatility and
+generalizability across scenarios and also requires retraining to adapt to new
+applications-a process that adds significant complexity and cost in real-world
+scenarios. In this study, we introduce an innovative semantics DISentanglement
+and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye
+perception and machine vision tasks. The approach derives a set of labels per
+task through multimodal large models, which grounding models are then applied
+for precise localization, enabling a comprehensive understanding and
+disentanglement of image components at the encoder side. At the decoding stage,
+a comprehensive reconstruction of the image is achieved by leveraging these
+encoded components alongside priors from generative models, thereby optimizing
+performance for both human visual perception and machine-based analytical
+tasks. Extensive experimental evaluations substantiate the robustness and
+effectiveness of DISCOVER, demonstrating superior performance in fulfilling the
+dual objectives of human and machine vision requirements.
+
+
+
+
+
+
+
+ ☆ DepthLab: From Partial to Complete
+
+
+
+
+
+
+
+
+ Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo
+
+
+ Missing values remain a common challenge for depth data across its wide range
+of applications, stemming from various causes like incomplete data acquisition
+and perspective alteration. This work bridges this gap with DepthLab, a
+foundation depth inpainting model powered by image diffusion priors. Our model
+features two notable strengths: (1) it demonstrates resilience to
+depth-deficient regions, providing reliable completion for both continuous
+areas and isolated points, and (2) it faithfully preserves scale consistency
+with the conditioned known depth when filling in missing values. Drawing on
+these advantages, our approach proves its worth in various downstream tasks,
+including 3D scene inpainting, text-to-3D scene generation, sparse-view
+reconstruction with DUST3R, and LiDAR depth completion, exceeding current
+solutions in both numerical performance and visual quality. Our project page
+with source code is available at https://johanan528.github.io/depthlab_web/.
+
+
+
+ comment: Project page and code: https://johanan528.github.io/depthlab_web/
+
+
+
+
+
+
+ ☆ EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive
+ Human Annotations for Text-to-Image Generation Model Evaluation
+
+
+
+
+
+
+
+
+ Shuhao Han, Haotian Fan, Jiachen Fu, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Chunle Guo, Chongyi Li
+
+
+ Recently, Text-to-Image (T2I) generation models have achieved significant
+advancements. Correspondingly, many automated metrics have emerged to evaluate
+the image-text alignment capabilities of generative models. However, the
+performance comparison among these automated metrics is limited by existing
+small datasets. Additionally, these datasets lack the capacity to assess the
+performance of automated metrics at a fine-grained level. In this study, we
+contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with
+fine-grained human annotations for image-text alignment-related tasks. In the
+construction process, we employ various strategies such as balanced prompt
+sampling and data re-annotation to ensure the diversity and reliability of our
+benchmark. This allows us to comprehensively evaluate the effectiveness of
+image-text alignment metrics for T2I models. Meanwhile, we introduce two new
+methods to evaluate the image-text alignment capabilities of T2I models:
+FGA-BLIP2 which involves end-to-end fine-tuning of a vision-language model to
+produce fine-grained image-text alignment scores and PN-VQA which adopts a
+novel positive-negative VQA manner in VQA models for zero-shot fine-grained
+evaluation. Both methods achieve impressive performance in image-text alignment
+evaluations. We also use our methods to rank current AIGC models, in which the
+results can serve as a reference source for future study and promote the
+development of T2I generation. The data and code will be made publicly
+available.
+
+
+
+
+
+
+
+ ☆ Dense-Face: Personalized Face Generation Model via Dense Annotation
+ Prediction
+
+
+ The text-to-image (T2I) personalization diffusion model can generate images
+of the novel concept based on the user input text caption. However, existing
+T2I personalized methods either require test-time fine-tuning or fail to
+generate images that align well with the given text caption. In this work, we
+propose a new T2I personalization diffusion model, Dense-Face, which can
+generate face images with a consistent identity as the given reference subject
+and align well with the text caption. Specifically, we introduce a
+pose-controllable adapter for the high-fidelity image generation while
+maintaining the text-based editing ability of the pre-trained stable diffusion
+(SD). Additionally, we use internal features of the SD UNet to predict dense
+face annotations, enabling the proposed method to gain domain knowledge in face
+generation. Empirically, our method achieves state-of-the-art or competitive
+generation performance in image-text alignment, identity preservation, and pose
+control.
+
+
+
+ comment: 15 figures, 5 tables
+
+
+
+
+
+
+ ☆ Accelerating Post-Tornado Disaster Assessment Using Advanced Deep
+ Learning Models
+
+
+ Post-disaster assessments of buildings and infrastructure are crucial for
+both immediate recovery efforts and long-term resilience planning. This
+research introduces an innovative approach to automating post-disaster
+assessments through advanced deep learning models. Our proposed system employs
+state-of-the-art computer vision techniques (YOLOv11 and ResNet50) to rapidly
+analyze images and videos from disaster sites, extracting critical information
+about building characteristics, including damage level of structural components
+and the extent of damage. Our experimental results show promising performance,
+with ResNet50 achieving 90.28% accuracy and an inference time of 1529ms per
+image on multiclass damage classification. This study contributes to the field
+of disaster management by offering a scalable, efficient, and objective tool
+for post-disaster analysis, potentially capable of transforming how communities
+and authorities respond to and learn from catastrophic events.
+
+
+
+ comment: 3 pages, 4 Figures, 1 Table
+
+
+
+
+
+
+ ☆ ERVD: An Efficient and Robust ViT-Based Distillation Framework for
+ Remote Sensing Image Retrieval
+
+
+
+
+
+
+
+
+ Le Dong, Qixuan Cao, Lei Pu, Fangfang Wu, Weisheng Dong, Xin Li, Guangming Shi
+
+
+ ERVD: An Efficient and Robust ViT-Based Distillation Framework for Remote
+Sensing Image Retrieval
+
+
+
+
+
+
+
+ ☆ UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by
+ Regional Visual Language Supervision
+
+
+ We present UniPLV, a powerful framework that unifies point clouds, images and
+text in a single learning paradigm for open-world 3D scene understanding.
+UniPLV employs the image modal as a bridge to co-embed 3D points with
+pre-aligned images and text in a shared feature space without requiring
+carefully crafted point cloud text pairs. To accomplish multi-modal alignment,
+we propose two key strategies:(i) logit and feature distillation modules
+between images and point clouds, and (ii) a vison-point matching module is
+given to explicitly correct the misalignment caused by points to pixels
+projection. To further improve the performance of our unified framework, we
+adopt four task-specific losses and a two-stage training strategy. Extensive
+experiments show that our method outperforms the state-of-the-art methods by an
+average of 15.6% and 14.8% for semantic segmentation over Base-Annotated and
+Annotation-Free tasks, respectively. The code will be released later.
+
+
+
+
+
+
+
+ ☆ VisionLLM-based Multimodal Fusion Network for Glottic Carcinoma Early
+ Detection
+
+
+ The early detection of glottic carcinoma is critical for improving patient
+outcomes, as it enables timely intervention, preserves vocal function, and
+significantly reduces the risk of tumor progression and metastasis. However,
+the similarity in morphology between glottic carcinoma and vocal cord dysplasia
+results in suboptimal detection accuracy. To address this issue, we propose a
+vision large language model-based (VisionLLM-based) multimodal fusion network
+for glottic carcinoma detection, known as MMGC-Net. By integrating image and
+text modalities, multimodal models can capture complementary information,
+leading to more accurate and robust predictions. In this paper, we collect a
+private real glottic carcinoma dataset named SYSU1H from the First Affiliated
+Hospital of Sun Yat-sen University, with 5,799 image-text pairs. We leverage an
+image encoder and additional Q-Former to extract vision embeddings and the
+Large Language Model Meta AI (Llama3) to obtain text embeddings. These
+modalities are then integrated through a laryngeal feature fusion block,
+enabling a comprehensive integration of image and text features, thereby
+improving the glottic carcinoma identification performance. Extensive
+experiments on the SYSU1H dataset demonstrate that MMGC-Net can achieve
+state-of-the-art performance, which is superior to previous multimodal models.
+
+
+ Hyperspectral salient object detection (HSOD) aims to extract targets or
+regions with significantly different spectra from hyperspectral images. While
+existing deep learning-based methods can achieve good detection results, they
+generally necessitate pixel-level annotations, which are notably challenging to
+acquire for hyperspectral images. To address this issue, we introduce point
+supervision into HSOD, and incorporate Spectral Saliency, derived from
+conventional HSOD methods, as a pivotal spectral representation within the
+framework. This integration leads to the development of a novel
+Spectrum-oriented Point-supervised Saliency Detector (SPSD). Specifically, we
+propose a novel pipeline, specifically designed for HSIs, to generate
+pseudo-labels, effectively mitigating the performance decline associated with
+point supervision strategy. Additionally, Spectral Saliency is employed to
+counteract information loss during model supervision and saliency refinement,
+thereby maintaining the structural integrity and edge accuracy of the detected
+objects. Furthermore, we introduce a Spectrum-transformed Spatial Gate to focus
+more precisely on salient regions while reducing feature redundancy. We have
+carried out comprehensive experiments on both HSOD-BIT and HS-SOD datasets to
+validate the efficacy of our proposed method, using mean absolute error (MAE),
+E-measure, F-measure, Area Under Curve, and Cross Correlation as evaluation
+metrics. For instance, on the HSOD-BIT dataset, our SPSD achieves a MAE of
+0.031 and an F-measure of 0.878. Thorough ablation studies have substantiated
+the effectiveness of each individual module and provided insights into the
+model's working mechanism. Further evaluations on RGB-thermal salient object
+detection datasets highlight the versatility of our approach.
+
+
+
+ comment: Accepted by IEEE TIM. Code: https://github.com/laprf/SPSD
+
+
+
+
+
+
+ ☆ Unveiling Visual Perception in Language Models: An Attention Head
+ Analysis Approach
+
+
+ Recent advancements in Multimodal Large Language Models (MLLMs) have
+demonstrated remarkable progress in visual understanding. This impressive leap
+raises a compelling question: how can language models, initially trained solely
+on linguistic data, effectively interpret and process visual content? This
+paper aims to address this question with systematic investigation across 4
+model families and 4 model scales, uncovering a unique class of attention heads
+that focus specifically on visual content. Our analysis reveals a strong
+correlation between the behavior of these attention heads, the distribution of
+attention weights, and their concentration on visual tokens within the input.
+These findings enhance our understanding of how LLMs adapt to multimodal tasks,
+demonstrating their potential to bridge the gap between textual and visual
+understanding. This work paves the way for the development of AI systems
+capable of engaging with diverse modalities.
+
+
+
+
+
+
+
+ ☆ Beyond the Known: Enhancing Open Set Domain Adaptation with Unknown
+ Exploration
+
+
+
+
+
+
+
+
+ Lucas Fernando Alvarenga e Silva, Samuel Felipe dos Santos, Nicu Sebe, Jurandy Almeida
+
+
+ Convolutional neural networks (CNNs) can learn directly from raw data,
+resulting in exceptional performance across various research areas. However,
+factors present in non-controllable environments such as unlabeled datasets
+with varying levels of domain and category shift can reduce model accuracy. The
+Open Set Domain Adaptation (OSDA) is a challenging problem that arises when
+both of these issues occur together. Existing OSDA approaches in literature
+only align known classes or use supervised training to learn unknown classes as
+a single new category. In this work, we introduce a new approach to improve
+OSDA techniques by extracting a set of high-confidence unknown instances and
+using it as a hard constraint to tighten the classification boundaries.
+Specifically, we use a new loss constraint that is evaluated in three different
+ways: (1) using pristine negative instances directly; (2) using data
+augmentation techniques to create randomly transformed negatives; and (3) with
+generated synthetic negatives containing adversarial features. We analyze
+different strategies to improve the discriminator and the training of the
+Generative Adversarial Network (GAN) used to generate synthetic negatives. We
+conducted extensive experiments and analysis on OVANet using three widely-used
+public benchmarks, the Office-31, Office-Home, and VisDA datasets. We were able
+to achieve similar H-score to other state-of-the-art methods, while increasing
+the accuracy on unknown categories.
+
+
+
+
+
+
+
+ ☆ Multi-Point Positional Insertion Tuning for Small Object Detection
+
+
+ Small object detection aims to localize and classify small objects within
+images. With recent advances in large-scale vision-language pretraining,
+finetuning pretrained object detection models has emerged as a promising
+approach. However, finetuning large models is computationally and memory
+expensive. To address this issue, this paper introduces multi-point positional
+insertion (MPI) tuning, a parameter-efficient finetuning (PEFT) method for
+small object detection. Specifically, MPI incorporates multiple positional
+embeddings into a frozen pretrained model, enabling the efficient detection of
+small objects by providing precise positional information to latent features.
+Through experiments, we demonstrated the effectiveness of the proposed method
+on the SODA-D dataset. MPI performed comparably to conventional PEFT methods,
+including CoOp and VPT, while significantly reducing the number of parameters
+that need to be tuned.
+
+
+ Previous research on retinal vessel segmentation is targeted at a specific
+image domain, mostly color fundus photography (CFP). In this paper we make a
+brave attempt to attack a more challenging task of broad-domain retinal vessel
+segmentation (BD-RVS), which is to develop a unified model applicable to varied
+domains including CFP, SLO, UWF, OCTA and FFA. To that end, we propose Dual
+Convoltuional Prompting (DCP) that learns to extract domain-specific features
+by localized prompting along both position and channel dimensions. DCP is
+designed as a plug-in module that can effectively turn a R2AU-Net based vessel
+segmentation network to a unified model, yet without the need of modifying its
+network structure. For evaluation we build a broad-domain set using five public
+domain-specific datasets including ROSSA, FIVES, IOSTAR, PRIME-FP20 and
+VAMPIRE. In order to benchmark BD-RVS on the broad-domain dataset, we
+re-purpose a number of existing methods originally developed in other contexts,
+producing eight baseline methods in total. Extensive experiments show the the
+proposed method compares favorably against the baselines for BD-RVS.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ☆ COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal
+ Object Detection
+
+
+ Single-modal object detection tasks often experience performance degradation
+when encountering diverse scenarios. In contrast, multimodal object detection
+tasks can offer more comprehensive information about object features by
+integrating data from various modalities. Current multimodal object detection
+methods generally use various fusion techniques, including conventional neural
+networks and transformer-based models, to implement feature fusion strategies
+and achieve complementary information. However, since multimodal images are
+captured by different sensors, there are often misalignments between them,
+making direct matching challenging. This misalignment hinders the ability to
+establish strong correlations for the same object across different modalities.
+In this paper, we propose a novel approach called the CrOss-Mamba interaction
+and Offset-guided fusion (COMO) framework for multimodal object detection
+tasks. The COMO framework employs the cross-mamba technique to formulate
+feature interaction equations, enabling multimodal serialized state
+computation. This results in interactive fusion outputs while reducing
+computational overhead and improving efficiency. Additionally, COMO leverages
+high-level features, which are less affected by misalignment, to facilitate
+interaction and transfer complementary information between modalities,
+addressing the positional offset challenges caused by variations in camera
+angles and capture times. Furthermore, COMO incorporates a global and local
+scanning mechanism in the cross-mamba module to capture features with local
+correlation, particularly in remote sensing images. To preserve low-level
+features, the offset-guided fusion mechanism ensures effective multiscale
+feature utilization, allowing the construction of a multiscale fusion data cube
+that enhances detection performance.
+
+
+
+
+
+
+
+ ☆ MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
+
+
+ With advances in foundational and vision-language models, and effective
+fine-tuning techniques, a large number of both general and special-purpose
+models have been developed for a variety of visual tasks. Despite the
+flexibility and accessibility of these models, no single model is able to
+handle all tasks and/or applications that may be envisioned by potential users.
+Recent approaches, such as visual programming and multimodal LLMs with
+integrated tools aim to tackle complex visual tasks, by way of program
+synthesis. However, such approaches overlook user constraints (e.g.,
+performance / computational needs), produce test-time sample-specific solutions
+that are difficult to deploy, and, sometimes, require low-level instructions
+that maybe beyond the abilities of a naive user. To address these limitations,
+we introduce MMFactory, a universal framework that includes model and metrics
+routing components, acting like a solution search engine across various
+available models. Based on a task description and few sample input-output pairs
+and (optionally) resource and/or performance constraints, MMFactory can suggest
+a diverse pool of programmatic solutions by instantiating and combining
+visio-lingual tools from its model repository. In addition to synthesizing
+these solutions, MMFactory also proposes metrics and benchmarks performance /
+resource characteristics, allowing users to pick a solution that meets their
+unique design constraints. From the technical perspective, we also introduced a
+committee-based solution proposer that leverages multi-agent LLM conversation
+to generate executable, diverse, universal, and robust solutions for the user.
+Experimental results show that MMFactory outperforms existing methods by
+delivering state-of-the-art solutions tailored to user problem specifications.
+Project page is available at https://davidhalladay.github.io/mmfactory_demo.
+
+
+ In the domain of facial recognition security, multimodal Face Anti-Spoofing
+(FAS) is essential for countering presentation attacks. However, existing
+technologies encounter challenges due to modality biases and imbalances, as
+well as domain shifts. Our research introduces a Mixture of Experts (MoE) model
+to address these issues effectively. We identified three limitations in
+traditional MoE approaches to multimodal FAS: (1) Coarse-grained experts'
+inability to capture nuanced spoofing indicators; (2) Gated networks'
+susceptibility to input noise affecting decision-making; (3) MoE's sensitivity
+to prompt tokens leading to overfitting with conventional learning methods. To
+mitigate these, we propose the Bypass Isolated Gating MoE (BIG-MoE) framework,
+featuring: (1) Fine-grained experts for enhanced detection of subtle spoofing
+cues; (2) An isolation gating mechanism to counteract input noise; (3) A novel
+differential convolutional prompt bypass enriching the gating network with
+critical local features, thereby improving perceptual capabilities. Extensive
+experiments on four benchmark datasets demonstrate significant generalization
+performance improvement in multimodal FAS task. The code is released at
+https://github.com/murInJ/BIG-MoE.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ☆ An Ensemble Approach to Short-form Video Quality Assessment Using
+ Multimodal LLM ICASSP 2025
+
+
+ The rise of short-form videos, characterized by diverse content, editing
+styles, and artifacts, poses substantial challenges for learning-based blind
+video quality assessment (BVQA) models. Multimodal large language models
+(MLLMs), renowned for their superior generalization capabilities, present a
+promising solution. This paper focuses on effectively leveraging a pretrained
+MLLM for short-form video quality assessment, regarding the impacts of
+pre-processing and response variability, and insights on combining the MLLM
+with BVQA models. We first investigated how frame pre-processing and sampling
+techniques influence the MLLM's performance. Then, we introduced a lightweight
+learning-based ensemble method that adaptively integrates predictions from the
+MLLM and state-of-the-art BVQA models. Our results demonstrated superior
+generalization performance with the proposed ensemble approach. Furthermore,
+the analysis of content-aware ensemble weights highlighted that some video
+characteristics are not fully represented by existing BVQA models, revealing
+potential directions to improve BVQA models further.
+
+
+ We propose a method for detecting the electrode positions in lithium-ion
+batteries. The process begins by identifying the region of interest (ROI) in
+the battery's X-ray image through corner point detection. A convolutional
+neural network is then used to regress the pole positions within this ROI.
+Finally, the regressed positions are optimized and corrected using corner point
+priors, significantly mitigating the loss of localization accuracy caused by
+operations such as feature map down-sampling and padding during network
+training. Our findings show that combining traditional pixel gradient analysis
+with CNN-based heatmap regression for keypoint extraction enhances both
+accuracy and efficiency, resulting in significant performance improvements.
+
+
+
+
+
+
+
+ ♻ ☆ Adversarial Attack Against Images Classification based on Generative
+ Adversarial Networks
+
+
+ Adversarial attacks on image classification systems have always been an
+important problem in the field of machine learning, and generative adversarial
+networks (GANs), as popular models in the field of image generation, have been
+widely used in various novel scenarios due to their powerful generative
+capabilities. However, with the popularity of generative adversarial networks,
+the misuse of fake image technology has raised a series of security problems,
+such as malicious tampering with other people's photos and videos, and invasion
+of personal privacy. Inspired by the generative adversarial networks, this work
+proposes a novel adversarial attack method, aiming to gain insight into the
+weaknesses of the image classification system and improve its anti-attack
+ability. Specifically, the generative adversarial networks are used to generate
+adversarial samples with small perturbations but enough to affect the
+decision-making of the classifier, and the adversarial samples are generated
+through the adversarial learning of the training generator and the classifier.
+From extensive experiment analysis, we evaluate the effectiveness of the method
+on a classical image classification dataset, and the results show that our
+model successfully deceives a variety of advanced classifiers while maintaining
+the naturalness of adversarial samples.
+
+
+
+ comment: 7 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality
+ Image Fusion
+
+
+ In extreme scenarios such as nighttime or low-visibility environments,
+achieving reliable perception is critical for applications like autonomous
+driving, robotics, and surveillance. Multi-modality image fusion, particularly
+integrating infrared imaging, offers a robust solution by combining
+complementary information from different modalities to enhance scene
+understanding and decision-making. However, current methods face significant
+limitations: GAN-based approaches often produce blurry images that lack
+fine-grained details, while AE-based methods may introduce bias toward specific
+modalities, leading to unnatural fusion results. To address these challenges,
+we propose DAE-Fuse, a novel two-phase discriminative autoencoder framework
+that generates sharp and natural fused images. Furthermore, We pioneer the
+extension of image fusion techniques from static images to the video domain
+while preserving temporal consistency across frames, thus advancing the
+perceptual capabilities required for autonomous navigation. Extensive
+experiments on public datasets demonstrate that DAE-Fuse achieves
+state-of-the-art performance on multiple benchmarks, with superior
+generalizability to tasks like medical image fusion.
+
+
+
+
+
+
+
+ ♻ ☆ TrackGo: A Flexible and Efficient Method for Controllable Video
+ Generation AAAI 2025
+
+
+ Recent years have seen substantial progress in diffusion-based controllable
+video generation. However, achieving precise control in complex scenarios,
+including fine-grained object parts, sophisticated motion trajectories, and
+coherent background movement, remains a challenge. In this paper, we introduce
+TrackGo, a novel approach that leverages free-form masks and arrows for
+conditional video generation. This method offers users with a flexible and
+precise mechanism for manipulating video content. We also propose the
+TrackAdapter for control implementation, an efficient and lightweight adapter
+designed to be seamlessly integrated into the temporal self-attention layers of
+a pretrained video generation model. This design leverages our observation that
+the attention map of these layers can accurately activate regions corresponding
+to motion in videos. Our experimental results demonstrate that our new
+approach, enhanced by the TrackAdapter, achieves state-of-the-art performance
+on key metrics such as FVD, FID, and ObjMC scores.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Guided Real Image Dehazing using YCbCr Color Space
+
+
+
+
+
+
+
+
+ Wenxuan Fang, Junkai Fan, Yu Zheng, Jiangwei Weng, Ying Tai, Jun Li
+
+
+ Image dehazing, particularly with learning-based methods, has gained
+significant attention due to its importance in real-world applications.
+However, relying solely on the RGB color space often fall short, frequently
+leaving residual haze. This arises from two main issues: the difficulty in
+obtaining clear textural features from hazy RGB images and the complexity of
+acquiring real haze/clean image pairs outside controlled environments like
+smoke-filled scenes. To address these issues, we first propose a novel
+Structure Guided Dehazing Network (SGDN) that leverages the superior structural
+properties of YCbCr features over RGB. It comprises two key modules: Bi-Color
+Guidance Bridge (BGB) and Color Enhancement Module (CEM). BGB integrates a
+phase integration module and an interactive attention module, utilizing the
+rich texture features of the YCbCr space to guide the RGB space, thereby
+recovering clearer features in both frequency and spatial domains. To maintain
+tonal consistency, CEM further enhances the color perception of RGB features by
+aggregating YCbCr channel information. Furthermore, for effective supervised
+learning, we introduce a Real-World Well-Aligned Haze (RW$^2$AH) dataset, which
+includes a diverse range of scenes from various geographical regions and
+climate conditions. Experimental results demonstrate that our method surpasses
+existing state-of-the-art methods across multiple real-world smoke/haze
+datasets. Code and Dataset:
+\textcolor{blue}{\url{https://github.com/fiwy0527/AAAI25_SGDN.}}
+
+
+
+
+
+
+
+ ♻ ☆ Optimal-state Dynamics Estimation for Physics-based Human Motion Capture
+ from Videos NeurIPS 2024
+
+
+
+
+
+
+
+
+ Cuong Le, Viktor Johansson, Manon Kok, Bastian Wandt
+
+
+ Human motion capture from monocular videos has made significant progress in
+recent years. However, modern approaches often produce temporal artifacts, e.g.
+in form of jittery motion and struggle to achieve smooth and physically
+plausible motions. Explicitly integrating physics, in form of internal forces
+and exterior torques, helps alleviating these artifacts. Current
+state-of-the-art approaches make use of an automatic PD controller to predict
+torques and reaction forces in order to re-simulate the input kinematics, i.e.
+the joint angles of a predefined skeleton. However, due to imperfect physical
+models, these methods often require simplifying assumptions and extensive
+preprocessing of the input kinematics to achieve good performance. To this end,
+we propose a novel method to selectively incorporate the physics models with
+the kinematics observations in an online setting, inspired by a neural
+Kalman-filtering approach. We develop a control loop as a meta-PD controller to
+predict internal joint torques and external reaction forces, followed by a
+physics-based motion simulation. A recurrent neural network is introduced to
+realize a Kalman filter that attentively balances the kinematics input and
+simulated motion, resulting in an optimal-state dynamics prediction. We show
+that this filtering step is crucial to provide an online supervision that helps
+balancing the shortcoming of the respective input motions, thus being important
+for not only capturing accurate global motion trajectories but also producing
+physically plausible human poses. The proposed approach excels in the
+physics-based human pose estimation task and demonstrates the physical
+plausibility of the predictive dynamics, compared to state of the art. The code
+is available on https://github.com/cuongle1206/OSDCap
+
+
+ Remote sensing semantic segmentation (RSS) is an essential technology in
+earth observation missions. Due to concerns over geographic information
+security, data privacy, storage bottleneck and industry competition,
+high-quality annotated remote sensing images are often isolated and distributed
+across institutions. The issue of remote sensing data islands poses challenges
+for fully utilizing isolated datasets to train a global model. Federated
+learning (FL), a privacy-preserving distributed collaborative learning
+technology, offers a potential solution to leverage isolated remote sensing
+data. Typically, remote sensing images from different institutions exhibit
+significant geographic heterogeneity, characterized by coupled
+class-distribution heterogeneity and object-appearance heterogeneity. However,
+existing FL methods lack consideration of them, leading to a decline in the
+performance of the global model when FL is directly applied to RSS. We propose
+a novel Geographic heterogeneity-aware Federated learning (GeoFed) framework to
+bridge data islands in RSS. Our framework consists of three modules, including
+the Global Insight Enhancement (GIE) module, the Essential Feature Mining (EFM)
+module and the Local-Global Balance (LoGo) module. Through the GIE module,
+class distribution heterogeneity is alleviated by introducing a prior global
+class distribution vector. We design an EFM module to alleviate object
+appearance heterogeneity by constructing essential features. Furthermore, the
+LoGo module enables the model to possess both global generalization capability
+and local adaptation. Extensive experiments on three public datasets (i.e.,
+FedFBP, FedCASID, FedInria) demonstrate that our GeoFed framework consistently
+outperforms the current state-of-the-art methods.
+
+
+ We introduce Shape Tokens, a 3D representation that is continuous, compact,
+and easy to incorporate into machine learning models. Shape Tokens act as
+conditioning vectors that represent shape information in a 3D flow-matching
+model. The flow-matching model is trained to approximate probability density
+functions corresponding to delta functions concentrated on the surfaces of
+shapes in 3D. By attaching Shape Tokens to various machine learning models, we
+can generate new shapes, convert images to 3D, align 3D shapes with text and
+images, and render shapes directly at variable, user specified, resolution.
+Moreover, Shape Tokens enable a systematic analysis of geometric properties
+such as normal, density, and deformation field. Across all tasks and
+experiments, utilizing Shape Tokens demonstrate strong performance compared to
+existing baselines.
+
+
+
+
+
+
+
+
+ Yuchen He, Chuyun Shen, Xiangfeng Wang, Bo Jin
+
+
+ Federated continual learning (FCL) aims to learn from sequential data stream
+in the decentralized federated learning setting, while simultaneously
+mitigating the catastrophic forgetting issue in classical continual learning.
+Existing FCL methods usually employ typical rehearsal mechanisms, which could
+result in privacy violations or additional onerous storage and computational
+burdens. In this work, an efficient and non-IID robust federated continual
+learning framework, called Federated Prototype-Augmented Prompt Learning
+(FPPL), is proposed. The FPPL can collaboratively learn lightweight prompts
+augmented by prototypes without rehearsal. On the client side, a fusion
+function is employed to fully leverage the knowledge contained in task-specific
+prompts for alleviating catastrophic forgetting. Additionally, global
+prototypes aggregated from the server are used to obtain unified representation
+through contrastive learning, mitigating the impact of non-IID-derived data
+heterogeneity. On the server side, locally uploaded prototypes are utilized to
+perform debiasing on the classifier, further alleviating the performance
+degradation caused by both non-IID and catastrophic forgetting. Empirical
+evaluations demonstrate the effectiveness of FPPL, achieving notable
+performance with an efficient design while remaining robust to diverse non-IID
+degrees. Code is available at: https://github.com/ycheoo/FPPL.
+
+
+
+
+
+
+
+ ♻ ☆ OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous
+ Driving
+
+
+ The rapid advancement of deep learning has intensified the need for
+comprehensive data for use by autonomous driving algorithms. High-quality
+datasets are crucial for the development of effective data-driven autonomous
+driving solutions. Next-generation autonomous driving datasets must be
+multimodal, incorporating data from advanced sensors that feature extensive
+data coverage, detailed annotations, and diverse scene representation. To
+address this need, we present OmniHD-Scenes, a large-scale multimodal dataset
+that provides comprehensive omnidirectional high-definition data. The
+OmniHD-Scenes dataset combines data from 128-beam LiDAR, six cameras, and six
+4D imaging radar systems to achieve full environmental perception. The dataset
+comprises 1501 clips, each approximately 30-s long, totaling more than 450K
+synchronized frames and more than 5.85 million synchronized sensor data points.
+We also propose a novel 4D annotation pipeline. To date, we have annotated 200
+clips with more than 514K precise 3D bounding boxes. These clips also include
+semantic segmentation annotations for static scene elements. Additionally, we
+introduce a novel automated pipeline for generation of the dense occupancy
+ground truth, which effectively leverages information from non-key frames.
+Alongside the proposed dataset, we establish comprehensive evaluation metrics,
+baseline models, and benchmarks for 3D detection and semantic occupancy
+prediction. These benchmarks utilize surround-view cameras and 4D imaging radar
+to explore cost-effective sensor solutions for autonomous driving applications.
+Extensive experiments demonstrate the effectiveness of our low-cost sensor
+configuration and its robustness under adverse conditions. Data will be
+released at https://www.2077ai.com/OmniHD-Scenes.
+
+
+
+
+
+
+
+ ♻ ☆ CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With
+ Multimodal Information
+
+
+ Electroencephalogram (EEG) signals have attracted significant attention from
+researchers due to their non-invasive nature and high temporal sensitivity in
+decoding visual stimuli. However, most recent studies have focused solely on
+the relationship between EEG and image data pairs, neglecting the valuable
+``beyond-image-modality" information embedded in EEG signals. This results in
+the loss of critical multimodal information in EEG. To address this limitation,
+we propose CognitionCapturer, a unified framework that fully leverages
+multimodal data to represent EEG signals. Specifically, CognitionCapturer
+trains Modality Expert Encoders for each modality to extract cross-modal
+information from the EEG modality. Then, it introduces a diffusion prior to map
+the EEG embedding space to the CLIP embedding space, followed by using a
+pretrained generative model, the proposed framework can reconstruct visual
+stimuli with high semantic and structural fidelity. Notably, the framework does
+not require any fine-tuning of the generative models and can be extended to
+incorporate more modalities. Through extensive experiments, we demonstrate that
+CognitionCapturer outperforms state-of-the-art methods both qualitatively and
+quantitatively. Code: https://github.com/XiaoZhangYES/CognitionCapturer.
+
+
+
+
+
+
+
+ ♻ ☆ From CNN to CNN + RNN: Adapting Visualization Techniques for Time-Series
+ Anomaly Detection
+
+
+ Deep neural networks are highly effective in solving complex problems but are
+often viewed as "black boxes," limiting their adoption in contexts where
+transparency and explainability are essential. This lack of visibility raises
+ethical and legal concerns, particularly in critical areas like security, where
+automated decisions can have significant consequences. The General Data
+Protection Regulation (GDPR) underscores the importance of justifying these
+decisions. In this work, we explore visualization techniques to improve the
+understanding of anomaly detection models based on convolutional recurrent
+neural networks (CNN + RNN) with a TimeDistributed layer. Our model combines
+VGG19 for convolutional feature extraction and a GRU layer for sequential
+analysis of real-time video data. While suitable for temporal data, this
+structure complicates gradient propagation, as sequence elements are processed
+independently, dissociating temporal information. We adapt visualization
+techniques such as saliency maps and Grad-CAM to address these challenges. This
+article highlights the difficulties in visually interpreting video-based models
+and demonstrates how techniques for static images can be adapted to recurrent
+architectures, offering a transitional solution in the absence of dedicated
+methods.
+
+
+
+
+
+
+
+ ♻ ☆ LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
+
+
+ Recent advancements in text-to-video (T2V) generative models have shown
+impressive capabilities. However, these models are still inadequate in aligning
+synthesized videos with human preferences (e.g., accurately reflecting text
+descriptions), which is particularly difficult to address, as human preferences
+are inherently subjective and challenging to formalize as objective functions.
+Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging
+human feedback for T2V model alignment. Specifically, we first construct a
+Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k
+human annotations, each including a score and its corresponding rationale.
+Based on this, we train a reward model LiFT-Critic to learn reward function
+effectively, which serves as a proxy for human judgment, measuring the
+alignment between given videos and human expectations. Lastly, we leverage the
+learned reward function to align the T2V model by maximizing the
+reward-weighted likelihood. As a case study, we apply our pipeline to
+CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B
+across all 16 metrics, highlighting the potential of human feedback in
+improving the alignment and quality of synthesized videos.
+
+
+ Compact UAV systems, while advancing delivery and surveillance, pose
+significant security challenges due to their small size, which hinders
+detection by traditional methods. This paper presents a cost-effective,
+unsupervised UAV detection method using spatial-temporal sequence processing to
+fuse multiple LiDAR scans for accurate UAV tracking in real-world scenarios.
+Our approach segments point clouds into foreground and background, analyzes
+spatial-temporal data, and employs a scoring mechanism to enhance detection
+accuracy. Tested on a public dataset, our solution placed 4th in the CVPR 2024
+UG2+ Challenge, demonstrating its practical effectiveness. We plan to
+open-source all designs, code, and sample data for the research community
+github.com/lianghanfang/UnLiDAR-UAV-Est.
+
+
+
+ comment: Paper Accepted for ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
+
+
+ Recent 3D content generation pipelines commonly employ Variational
+Autoencoders (VAEs) to encode shapes into compact latent representations for
+diffusion-based generation. However, the widely adopted uniform point sampling
+strategy in Shape VAE training often leads to a significant loss of geometric
+details, limiting the quality of shape reconstruction and downstream generation
+tasks. We present Dora-VAE, a novel approach that enhances VAE reconstruction
+through our proposed sharp edge sampling strategy and a dual cross-attention
+mechanism. By identifying and prioritizing regions with high geometric
+complexity during training, our method significantly improves the preservation
+of fine-grained shape features. Such sampling strategy and the dual attention
+mechanism enable the VAE to focus on crucial geometric details that are
+typically missed by uniform sampling approaches. To systematically evaluate VAE
+reconstruction quality, we additionally propose Dora-bench, a benchmark that
+quantifies shape complexity through the density of sharp edges, introducing a
+new metric focused on reconstruction accuracy at these salient geometric
+features. Extensive experiments on the Dora-bench demonstrate that Dora-VAE
+achieves comparable reconstruction quality to the state-of-the-art dense
+XCube-VAE while requiring a latent space at least 8$\times$ smaller (1,280 vs.
+> 10,000 codes). We will release our code and benchmark dataset to facilitate
+future research in 3D shape modeling.
+
+
+
+
+
+
+
+
+ Trung Trinh, Markus Heinonen, Luigi Acerbi, Samuel Kaski
+
+
+ Deep neural networks (DNNs) excel on clean images but struggle with corrupted
+ones. Incorporating specific corruptions into the data augmentation pipeline
+can improve robustness to those corruptions but may harm performance on clean
+images and other types of distortion. In this paper, we introduce an
+alternative approach that improves the robustness of DNNs to a wide range of
+corruptions without compromising accuracy on clean images. We first demonstrate
+that input perturbations can be mimicked by multiplicative perturbations in the
+weight space. Leveraging this, we propose Data Augmentation via Multiplicative
+Perturbation (DAMP), a training method that optimizes DNNs under random
+multiplicative weight perturbations. We also examine the recently proposed
+Adaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs
+under adversarial multiplicative weight perturbations. Experiments on image
+classification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural
+network architectures (ResNet50, ViT-S/16, ViT-B/16) show that DAMP enhances
+model generalization performance in the presence of corruptions across
+different settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from
+scratch, reaching the top-1 error of 23.7% which is comparable to ResNet50
+without extensive data augmentations.
+
+
+
+ comment: Published at NeurIPS 2024 (spotlight). Code is available at
+ https://github.com/trungtrinh44/DAMP
+
+
+
+
+
+
+ ♻ ☆ InstaGraM: Instance-level Graph Modeling for Vectorized HD Map Learning
+
+
+ For scalable autonomous driving, a robust map-based localization system,
+independent of GPS, is fundamental. To achieve such map-based localization,
+online high-definition (HD) map construction plays a significant role in
+accurate estimation of the pose. Although recent advancements in online HD map
+construction have predominantly investigated on vectorized representation due
+to its effectiveness, they suffer from computational cost and fixed parametric
+model, which limit scalability. To alleviate these limitations, we propose a
+novel HD map learning framework that leverages graph modeling. This framework
+is designed to learn the construction of diverse geometric shapes, thereby
+enhancing the scalability of HD map construction. Our approach involves
+representing the map elements as an instance-level graph by decomposing them
+into vertices and edges to facilitate accurate and efficient end-to-end
+vectorized HD map learning. Furthermore, we introduce an association strategy
+using a Graph Neural Network to efficiently handle the complex geometry of
+various map elements, while maintaining scalability. Comprehensive experiments
+on public open dataset show that our proposed network outperforms
+state-of-the-art model by $1.6$ mAP. We further showcase the superior
+scalability of our approach compared to state-of-the-art methods, achieving a
+$4.8$ mAP improvement in long range configuration. Our code is available at
+https://github.com/juyebshin/InstaGraM.
+
+
+
+ comment: Code available at https://github.com/juyebshin/InstaGraM
+
+
+
+
+
+
+
+ Yun Liu, Chengwen Zhang, Ruofan Xing, Bingda Tang, Bowen Yang, Li Yi
+
+
+ Understanding how humans cooperatively rearrange household objects is
+critical for VR/AR and human-robot interaction. However, in-depth studies on
+modeling these behaviors are under-researched due to the lack of relevant
+datasets. We fill this gap by presenting CORE4D, a novel large-scale 4D
+human-object-human interaction dataset focusing on collaborative object
+rearrangement, which encompasses diverse compositions of various object
+geometries, collaboration modes, and 3D scenes. With 1K human-object-human
+motion sequences captured in the real world, we enrich CORE4D by contributing
+an iterative collaboration retargeting strategy to augment motions to a variety
+of novel objects. Leveraging this approach, CORE4D comprises a total of 11K
+collaboration sequences spanning 3K real and virtual object shapes. Benefiting
+from extensive motion patterns provided by CORE4D, we benchmark two tasks
+aiming at generating human-object interaction: human-object motion forecasting
+and interaction synthesis. Extensive experiments demonstrate the effectiveness
+of our collaboration retargeting strategy and indicate that CORE4D has posed
+new challenges to existing human-object interaction generation methodologies.
+
+
+
+
+
+
+
+ ♻ ☆ ErasableMask: A Robust and Erasable Privacy Protection Scheme against
+ Black-box Face Recognition Models
+
+
+
+
+
+
+
+
+ Sipeng Shen, Yunming Zhang, Dengpan Ye, Xiuwen Shi, Long Tang, Haoran Duan, Jiacheng Deng, Ziyi Liu
+
+
+ While face recognition (FR) models have brought remarkable convenience in
+face verification and identification, they also pose substantial privacy risks
+to the public. Existing facial privacy protection schemes usually adopt
+adversarial examples to disrupt face verification of FR models. However, these
+schemes often suffer from weak transferability against black-box FR models and
+permanently damage the identifiable information that cannot fulfill the
+requirements of authorized operations such as forensics and authentication. To
+address these limitations, we propose ErasableMask, a robust and erasable
+privacy protection scheme against black-box FR models. Specifically, via
+rethinking the inherent relationship between surrogate FR models, ErasableMask
+introduces a novel meta-auxiliary attack, which boosts black-box
+transferability by learning more general features in a stable and balancing
+optimization strategy. It also offers a perturbation erasion mechanism that
+supports the erasion of semantic perturbations in protected face without
+degrading image quality. To further improve performance, ErasableMask employs a
+curriculum learning strategy to mitigate optimization conflicts between
+adversarial attack and perturbation erasion. Extensive experiments on the
+CelebA-HQ and FFHQ datasets demonstrate that ErasableMask achieves the
+state-of-the-art performance in transferability, achieving over 72% confidence
+on average in commercial FR systems. Moreover, ErasableMask also exhibits
+outstanding perturbation erasion performance, achieving over 90% erasion
+success rate.
+
+
+
+
+
+
+
+
+ Haowei Zhu, Fangyuan Zhang, Rui Qin, Tianxiang Pan, Junhai Yong, Bin Wang
+
+
+ As the scale of vision models continues to grow, Visual Prompt Tuning (VPT)
+has emerged as a parameter-efficient transfer learning technique, noted for its
+superior performance compared to full fine-tuning. However, indiscriminately
+applying prompts to every layer without considering their inherent
+correlations, can cause significant disturbances, leading to suboptimal
+transferability. Additionally, VPT disrupts the original self-attention
+structure, affecting the aggregation of visual features, and lacks a mechanism
+for explicitly mining discriminative visual features, which are crucial for
+classification. To address these issues, we propose a Semantic Hierarchical
+Prompt (SHIP) fine-tuning strategy. We adaptively construct semantic
+hierarchies and use semantic-independent and semantic-shared prompts to learn
+hierarchical representations. We also integrate attribute prompts and a prompt
+matching loss to enhance feature discrimination and employ decoupled attention
+for robustness and reduced inference costs. SHIP significantly improves
+performance, achieving a 4.9% gain in accuracy over VPT with a ViT-B/16
+backbone on VTAB-1k tasks. Our code is available at
+https://github.com/haoweiz23/SHIP.
+
+
+
+
+
+
+
+
+ Tina Dorosti, Manuel Schultheiss, Felix Hofmann, Johannes Thalhammer, Luisa Kirchner, Theresa Urban, Franz Pfeiffer, Florian Schaff, Tobias Lasser, Daniela Pfeiffer
+
+
+ We aim to optimize the binary detection of Chronic Obstructive Pulmonary
+Disease (COPD) based on emphysema presence in the lung with convolutional
+neural networks (CNN) by exploring manually adjusted versus automated
+window-setting optimization (WSO) on computed tomography (CT) images. 7,194 CT
+images (3,597 with COPD; 3,597 healthy controls) from 78 subjects were selected
+retrospectively (10.2018-12.2021) and preprocessed. For each image, intensity
+values were manually clipped to the emphysema window setting and a baseline
+'full-range' window setting. Class-balanced train, validation, and test sets
+contained 3,392, 1,114, and 2,688 images. The network backbone was optimized by
+comparing various CNN architectures. Furthermore, automated WSO was implemented
+by adding a customized layer to the model. The image-level area under the
+Receiver Operating Characteristics curve (AUC) [lower, upper limit 95%
+confidence] was utilized to compare model variations. Repeated inference (n=7)
+on the test set showed that the DenseNet was the most efficient backbone and
+achieved a mean AUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input
+images manually adjusted to the emphysema window, the DenseNet model predicted
+COPD with a mean AUC of 0.86 [0.82, 0.89]. By adding a customized WSO layer to
+the DenseNet, an optimal window in the proximity of the emphysema window
+setting was learned automatically, and a mean AUC of 0.82 [0.78, 0.86] was
+achieved. Detection of COPD with DenseNet models was improved by WSO of CT data
+to the emphysema window setting range.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Human-in-the-Loop Test-Time Adaptation by Synergizing Active
+ Learning and Model Selection
+
+
+
+
+
+
+
+
+ Yushu Li, Yongyi Su, Xulei Yang, Kui Jia, Xun Xu
+
+
+ Existing test-time adaptation (TTA) approaches often adapt models with the
+unlabeled testing data stream. A recent attempt relaxed the assumption by
+introducing limited human annotation, referred to as Human-In-the-Loop
+Test-Time Adaptation (HILTTA) in this study. The focus of existing HILTTA
+studies lies in selecting the most informative samples to label, a.k.a. active
+learning. In this work, we are motivated by a pitfall of TTA, i.e. sensitivity
+to hyper-parameters, and propose to approach HILTTA by synergizing active
+learning and model selection. Specifically, we first select samples for human
+annotation (active learning) and then use the labeled data to select optimal
+hyper-parameters (model selection). To prevent the model selection process from
+overfitting to local distributions, multiple regularization techniques are
+employed to complement the validation objective. A sample selection strategy is
+further tailored by considering the balance between active learning and model
+selection purposes. We demonstrate on 5 TTA datasets that the proposed HILTTA
+approach is compatible with off-the-shelf TTA methods and such combinations
+substantially outperform the state-of-the-art HILTTA methods. Importantly, our
+proposed method can always prevent choosing the worst hyper-parameters on all
+off-the-shelf TTA methods. The source code is available at
+https://github.com/Yushu-Li/HILTTA.
+
+
+
+ comment: Accepted at Transactions on Machine Learning Research (TMLR)
+
+
+
+
+
+
+ ♻ ☆ BS-LDM: Effective Bone Suppression in High-Resolution Chest X-Ray Images
+ with Conditional Latent Diffusion Models
+
+
+
+
+
+
+
+
+ Yifei Sun, Zhanghao Chen, Hao Zheng, Ruiquan Ge, Jin Liu, Wenwen Min, Ahmed Elazab, Xiang Wan, Changmiao Wang
+
+
+ The interference of overlapping bones and pulmonary structures can reduce the
+effectiveness of Chest X-ray (CXR) examinations. Bone suppression techniques
+have been developed to improve diagnostic accuracy. Dual-energy subtraction
+(DES) imaging, a common method for bone suppression, is costly and exposes
+patients to higher radiation levels. Deep learning-based image generation
+methods have been proposed as alternatives, however, they often fail to produce
+high-quality and high-resolution images, resulting in the loss of critical
+lesion information and texture details. To address these issues, in this paper,
+we introduce an end-to-end framework for bone suppression in high-resolution
+CXR images, termed BS-LDM. This framework employs a conditional latent
+diffusion model to generate high-resolution soft tissue images with fine detail
+and critical lung pathology by performing bone suppression in the latent space.
+We implement offset noise during the noise addition phase of the training
+process to better render low-frequency information in soft tissue images.
+Additionally, we introduce a dynamic clipping strategy during the sampling
+process to refine pixel intensity in the generated soft tissue images. We
+compiled a substantial and high-quality bone suppression dataset, SZCH-X-Rays,
+including high-resolution paired CXR and DES soft tissue images from 818
+patients, collected from our partner hospitals. Moreover, we pre-processed 241
+pairs of CXR and DES soft tissue images from the JSRT dataset, the largest
+publicly available dataset. Comprehensive experimental and clinical evaluations
+demonstrate that BS-LDM exhibits superior bone suppression capabilities,
+highlighting its significant clinical potential.
+
+
+
+ comment: 9 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Enhancing Space-time Video Super-resolution via Spatial-temporal Feature
+ Interaction
+
+
+ The target of space-time video super-resolution (STVSR) is to increase both
+the frame rate (also referred to as the temporal resolution) and the spatial
+resolution of a given video. Recent approaches solve STVSR using end-to-end
+deep neural networks. A popular solution is to first increase the frame rate of
+the video; then perform feature refinement among different frame features; and
+last increase the spatial resolutions of these features. The temporal
+correlation among features of different frames is carefully exploited in this
+process. The spatial correlation among features of different (spatial)
+resolutions, despite being also very important, is however not emphasized. In
+this paper, we propose a spatial-temporal feature interaction network to
+enhance STVSR by exploiting both spatial and temporal correlations among
+features of different frames and spatial resolutions. Specifically, the
+spatial-temporal frame interpolation module is introduced to interpolate low-
+and high-resolution intermediate frame features simultaneously and
+interactively. The spatial-temporal local and global refinement modules are
+respectively deployed afterwards to exploit the spatial-temporal correlation
+among different features for their refinement. Finally, a novel motion
+consistency loss is employed to enhance the motion continuity among
+reconstructed frames. We conduct experiments on three standard benchmarks,
+Vid4, Vimeo-90K and Adobe240, and the results demonstrate that our method
+improves the state of the art methods by a considerable margin. Our codes will
+be available at
+https://github.com/yuezijie/STINet-Space-time-Video-Super-resolution.
+
+
+
+ comment: Neural Networks
+
+
+
+
+
+
+ ♻ ☆ A Critical Assessment of Visual Sound Source Localization Models
+ Including Negative Audio ICASSP 2025
+
+
+ The task of Visual Sound Source Localization (VSSL) involves identifying the
+location of sound sources in visual scenes, integrating audio-visual data for
+enhanced scene understanding. Despite advancements in state-of-the-art (SOTA)
+models, we observe three critical flaws: i) The evaluation of the models is
+mainly focused in sounds produced by objects that are visible in the image, ii)
+The evaluation often assumes a prior knowledge of the size of the sounding
+object, and iii) No universal threshold for localization in real-world
+scenarios is established, as previous approaches only consider positive
+examples without accounting for both positive and negative cases. In this
+paper, we introduce a novel test set and metrics designed to complete the
+current standard evaluation of VSSL models by testing them in scenarios where
+none of the objects in the image corresponds to the audio input, i.e. a
+negative audio. We consider three types of negative audio: silence, noise and
+offscreen. Our analysis reveals that numerous SOTA models fail to appropriately
+adjust their predictions based on audio input, suggesting that these models may
+not be leveraging audio information as intended. Additionally, we provide a
+comprehensive analysis of the range of maximum values in the estimated
+audio-visual similarity maps, in both positive and negative audio cases, and
+show that most of the models are not discriminative enough, making them unfit
+to choose a universal threshold appropriate to perform sound localization
+without any a priori information of the sounding object, that is, object size
+and visibility.
+
+
+
+ comment: Accepted in ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ The Practice of Averaging Rate-Distortion Curves over Testsets to
+ Compare Learned Video Codecs Can Cause Misleading Conclusions
+
+
+
+
+
+
+
+
+ M. Akin Yilmaz, Onur Keleş, A. Murat Tekalp
+
+
+ This paper aims to demonstrate how the prevalent practice in the learned
+video compression community of averaging rate-distortion (RD) curves across a
+test video set can lead to misleading conclusions in evaluating codec
+performance. Through analytical analysis of a simple case and experimental
+results with two recent learned video codecs, we show how averaged RD curves
+can mislead comparative evaluation of different codecs, particularly when
+videos in a dataset have varying characteristics and operating ranges. We
+illustrate how a single video with distinct RD characteristics from the rest of
+the test set can disproportionately influence the average RD curve, potentially
+overshadowing a codec's superior performance across most individual sequences.
+Using two recent learned video codecs on the UVG dataset as a case study, we
+demonstrate computing performance metrics, such as the BD rate, from the
+average RD curve suggests conclusions that contradict those reached from
+calculating the average of per-sequence metrics. Hence, we argue that the
+learned video compression community should also report per-sequence RD curves
+and performance metrics for a test set should be computed from the average of
+per-sequence metrics, similar to the established practice in traditional video
+coding, to ensure fair and accurate codec comparisons.
+
+
+
+ comment: Submitted to IEEE Signal Processing Letters
+
+
+
+
+
+
+ ♻ ☆ SpikeGS: Reconstruct 3D scene via fast-moving bio-inspired sensors AAAI2025
+
+
+ 3D Gaussian Splatting (3DGS) demonstrates unparalleled superior performance
+in 3D scene reconstruction. However, 3DGS heavily relies on the sharp images.
+Fulfilling this requirement can be challenging in real-world scenarios
+especially when the camera moves fast, which severely limits the application of
+3DGS. To address these challenges, we proposed Spike Gausian Splatting
+(SpikeGS), the first framework that integrates the spike streams into 3DGS
+pipeline to reconstruct 3D scenes via a fast-moving bio-inspired camera. With
+accumulation rasterization, interval supervision, and a specially designed
+pipeline, SpikeGS extracts detailed geometry and texture from high temporal
+resolution but texture lacking spike stream, reconstructs 3D scenes captured in
+1 second. Extensive experiments on multiple synthetic and real-world datasets
+demonstrate the superiority of SpikeGS compared with existing spike-based and
+deblur 3D scene reconstruction methods. Codes and data will be released soon.
+
+
+ The uses of machine learning (ML) have snowballed in recent years. In many
+cases, ML models are highly complex, and their operation is beyond the
+understanding of human decision-makers. Nevertheless, some uses of ML models
+involve high-stakes and safety-critical applications. Explainable artificial
+intelligence (XAI) aims to help human decision-makers in understanding the
+operation of such complex ML models, thus eliciting trust in their operation.
+Unfortunately, the majority of past XAI work is based on informal approaches,
+that offer no guarantees of rigor. Unsurprisingly, there exists comprehensive
+experimental and theoretical evidence confirming that informal methods of XAI
+can provide human-decision makers with erroneous information. Logic-based XAI
+represents a rigorous approach to explainability; it is model-based and offers
+the strongest guarantees of rigor of computed explanations. However, a
+well-known drawback of logic-based XAI is the complexity of logic reasoning,
+especially for highly complex ML models. Recent work proposed
+distance-restricted explanations, i.e. explanations that are rigorous provided
+the distance to a given input is small enough. Distance-restricted
+explainability is tightly related with adversarial robustness, and it has been
+shown to scale for moderately complex ML models, but the number of inputs still
+represents a key limiting factor. This paper investigates novel algorithms for
+scaling up the performance of logic-based explainers when computing and
+enumerating ML model explanations with a large number of inputs.
+
+
+
+
+
+
+
+ ♻ ☆ Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive
+ Learning with Dense Labeling
+
+
+ Growing labor shortages are increasing the demand for domestic service robots
+(DSRs) to assist in various settings. In this study, we develop a DSR that
+transports everyday objects to specified pieces of furniture based on
+open-vocabulary instructions. Our approach focuses on retrieving images of
+target objects and receptacles from pre-collected images of indoor
+environments. For example, given an instruction "Please get the right red towel
+hanging on the metal towel rack and put it in the white washing machine on the
+left," the DSR is expected to carry the red towel to the washing machine based
+on the retrieved images. This is challenging because the correct images should
+be retrieved from thousands of collected images, which may include many images
+of similar towels and appliances. To address this, we propose RelaX-Former,
+which learns diverse and robust representations from among positive, unlabeled
+positive, and negative samples. We evaluated RelaX-Former on a dataset
+containing real-world indoor images and human annotated instructions including
+complex referring expressions. The experimental results demonstrate that
+RelaX-Former outperformed existing baseline models across standard image
+retrieval metrics. Moreover, we performed physical experiments using a DSR to
+evaluate the performance of our approach in a zero-shot transfer setting. The
+experiments involved the DSR to carry objects to specific receptacles based on
+open-vocabulary instructions, achieving an overall success rate of 75%.
+
+
+
+ comment: Accepted for IEEE RA-L 2025
+
+
+
+
+
+
+ ♻ ☆ Singular Value Scaling: Efficient Generative Model Compression via
+ Pruned Weights Refinement AAAI 2025
+
+
+ While pruning methods effectively maintain model performance without extra
+training costs, they often focus solely on preserving crucial connections,
+overlooking the impact of pruned weights on subsequent fine-tuning or
+distillation, leading to inefficiencies. Moreover, most compression techniques
+for generative models have been developed primarily for GANs, tailored to
+specific architectures like StyleGAN, and research into compressing Diffusion
+models has just begun. Even more, these methods are often applicable only to
+GANs or Diffusion models, highlighting the need for approaches that work across
+both model types. In this paper, we introduce Singular Value Scaling (SVS), a
+versatile technique for refining pruned weights, applicable to both model
+types. Our analysis reveals that pruned weights often exhibit dominant singular
+vectors, hindering fine-tuning efficiency and leading to suboptimal performance
+compared to random initialization. Our method enhances weight initialization by
+minimizing the disparities between singular values of pruned weights, thereby
+improving the fine-tuning process. This approach not only guides the compressed
+model toward superior solutions but also significantly speeds up fine-tuning.
+Extensive experiments on StyleGAN2, StyleGAN3 and DDPM demonstrate that SVS
+improves compression performance across model types without additional training
+costs. Our code is available at:
+https://github.com/LAIT-CVLab/Singular-Value-Scaling.
+
+
+ Cooperatively utilizing both ego-vehicle and infrastructure sensor data via
+V2X communication has emerged as a promising approach for advanced autonomous
+driving. However, current research mainly focuses on improving individual
+modules, rather than taking end-to-end learning to optimize final planning
+performance, resulting in underutilized data potential. In this paper, we
+introduce UniV2X, a pioneering cooperative autonomous driving framework that
+seamlessly integrates all key driving modules across diverse views into a
+unified network. We propose a sparse-dense hybrid data transmission and fusion
+mechanism for effective vehicle-infrastructure cooperation, offering three
+advantages: 1) Effective for simultaneously enhancing agent perception, online
+mapping, and occupancy prediction, ultimately improving planning performance.
+2) Transmission-friendly for practical and limited communication conditions. 3)
+Reliable data fusion with interpretability of this hybrid data. We implement
+UniV2X, as well as reproducing several benchmark methods, on the challenging
+DAIR-V2X, the real-world cooperative driving dataset. Experimental results
+demonstrate the effectiveness of UniV2X in significantly enhancing planning
+performance, as well as all intermediate output performance. The project is
+available at
+\href{https://github.com/AIR-THU/UniV2X}{https://github.com/AIR-THU/UniV2X}.
+
+
+
+ comment: Accepted by AAAI 2025. Add more open-loop evaluation indicators
+
+
+
+
+
+
+ ♻ ☆ The Potential of Convolutional Neural Networks for Cancer Detection
+
+
+ Early detection of cancer is critical in improving treatment outcomes and
+increasing survival rates, particularly for common cancers such as lung,
+breast, and prostate which collectively contribute to a significant global
+mortality burden. With advancements in imaging technologies and data
+processing, Convolutional Neural Networks (CNNs) have emerged as a powerful
+tool for analyzing and classifying medical images, enabling more precise cancer
+detection. This paper provides a comprehensive review of recent studies
+leveraging CNN models for detecting ten different types of cancer. Each study
+employs distinct CNN architectures to identify patterns associated with these
+cancers, utilizing diverse datasets. Key differences and strengths of these
+architectures are meticulously compared and analyzed, highlighting their
+efficacy in improving early detection. Beyond reviewing the performance and
+limitations of CNN-based cancer detection methods, this study explores the
+feasibility of integrating CNNs into clinical settings as an early detection
+tool, potentially complementing or replacing traditional methods. Despite
+significant progress, challenges remain, including data diversity, result
+interpretation, and ethical considerations. By identifying the best-performing
+CNN architectures and providing a comparative analysis, this study aims to
+contribute a comprehensive perspective on the application of CNNs in cancer
+detection and their role in advancing diagnostic capabilities in healthcare.
+
+
+
+
+
+
+
+
+ Nokyung Park, Daewon Chae, Jeongyong Shim, Sangpil Kim, Eun-Sol Kim, Jinkyu Kim
+
+
+ Learning domain-invariant visual representations is important to train a
+model that can generalize well to unseen target task domains. Recent works
+demonstrate that text descriptions contain high-level class-discriminative
+information and such auxiliary semantic cues can be used as effective pivot
+embedding for domain generalization problems. However, they use pivot embedding
+in a global manner (i.e., aligning an image embedding with sentence-level text
+embedding), which does not fully utilize the semantic cues of given text
+description. In this work, we advocate for the use of local alignment between
+image regions and corresponding textual descriptions to get domain-invariant
+features. To this end, we first represent image and text inputs as graphs. We
+then cluster nodes within these graphs and match the graph-based image node
+features to the nodes of textual graphs. This matching process is conducted
+both globally and locally, tightly aligning visual and textual semantic
+sub-structures. We experiment with large-scale public datasets, such as CUB-DG
+and DomainBed, and our model achieves matched or better state-of-the-art
+performance on these datasets. The code is available at:
+https://github.com/noparkee/Graph-Clustering-based-DG
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Parameter-Efficient Fine-Tuning to Enable Foundation Models in
+ Federated Learning
+
+
+
+
+
+
+
+
+ Guangyu Sun, Umar Khalid, Matias Mendieta, Pu Wang, Chen Chen
+
+
+ Federated learning (FL) has emerged as a promising paradigm for enabling the
+collaborative training of models without centralized access to the raw data on
+local devices. In the typical FL paradigm (e.g., FedAvg), model weights are
+sent to and from the server each round to participating clients. Recently, the
+use of small pre-trained models has been shown to be effective in federated
+learning optimization and improving convergence. However, recent
+state-of-the-art pre-trained models are getting more capable but also have more
+parameters, known as the "Foundation Models." In conventional FL, sharing the
+enormous model weights can quickly put a massive communication burden on the
+system, especially if more capable models are employed. Can we find a solution
+to enable those strong and readily available pre-trained models in FL to
+achieve excellent performance while simultaneously reducing the communication
+burden? To this end, we investigate the use of parameter-efficient fine-tuning
+in federated learning and thus introduce a new framework: FedPEFT.
+Specifically, we systemically evaluate the performance of FedPEFT across a
+variety of client stability, data distribution, and differential privacy
+settings. By only locally tuning and globally sharing a small portion of the
+model weights, significant reductions in the total communication overhead can
+be achieved while maintaining competitive or even better performance in a wide
+range of federated learning scenarios, providing insight into a new paradigm
+for practical and effective federated systems.
+
+
+
+ comment: Published in 2024 IEEE International Conference on Big Data
+
+ In this paper, we introduce the Diff-Instruct* (DI*), an image data-free
+approach for building one-step text-to-image generative models that align with
+human preference while maintaining the ability to generate highly realistic
+images. We frame human preference alignment as online reinforcement learning
+using human feedback (RLHF), where the goal is to maximize the reward function
+while regularizing the generator distribution to remain close to a reference
+diffusion process. Unlike traditional RLHF approaches, which rely on the KL
+divergence for regularization, we introduce a novel score-based divergence
+regularization, which leads to significantly better performances. Although the
+direct calculation of this preference alignment objective remains intractable,
+we demonstrate that we can efficiently compute its gradient by deriving an
+equivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to
+train a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step
+text-to-image model, which can generate images of a resolution of 1024x1024
+with only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference
+time and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly
+in PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1
+on Human Preference Score benchmark, establishing a new state-of-the-art
+benchmark of human-preferred 1-step text-to-image generative models. Besides
+the strong quantitative performances, extensive qualitative comparisons also
+confirm the advantages of DI* in terms of maintaining diversity, improving
+image layouts, and enhancing aesthetic colors. We have released our
+industry-ready model on the homepage:
+\url{https://github.com/pkulwj1994/diff_instruct_star}.
+
+
+
+ comment: revision: 2.6B 1-step text-to-image model outperforms 12B
+ Flux-dev-50step model in human preferences
+
+ Edge detection has been one of the most difficult challenges in computer
+vision because of the difficulty in identifying the borders and edges from the
+real-world images including objects of varying kinds and sizes. Methods based
+on ensemble learning, which use a combination of backbones and attention
+modules, outperformed more conventional approaches, such as Sobel and Canny
+edge detection. Nevertheless, these algorithms are still challenged when faced
+with complicated scene photos. In addition, the identified edges utilizing the
+current methods are not refined and often include incorrect edges. In this
+work, we used a Cascaded Ensemble Canny operator to solve these problems and
+detect the object edges. The most difficult Fresh and Rotten and Berkeley
+datasets are used to test the suggested approach in Python. In terms of
+performance metrics and output picture quality, the acquired results outperform
+the specified edge detection networks
+
+
+
+ comment: 2 Pages and 2 Figures
+
+
+
+
+
+
+ ♻ ☆ Adversarial Score identity Distillation: Rapidly Surpassing the Teacher
+ in One Step
+
+
+
+
+
+
+
+
+ Mingyuan Zhou, Huangjie Zheng, Yi Gu, Zhendong Wang, Hai Huang
+
+
+ Score identity Distillation (SiD) is a data-free method that has achieved
+SOTA performance in image generation by leveraging only a pretrained diffusion
+model, without requiring any training data. However, its ultimate performance
+is constrained by how accurate the pretrained model captures the true data
+scores at different stages of the diffusion process. In this paper, we
+introduce SiDA (SiD with Adversarial Loss), which not only enhances generation
+quality but also improves distillation efficiency by incorporating real images
+and adversarial loss. SiDA utilizes the encoder from the generator's score
+network as a discriminator, allowing it to distinguish between real images and
+those generated by SiD. The adversarial loss is batch-normalized within each
+GPU and then combined with the original SiD loss. This integration effectively
+incorporates the average "fakeness" per GPU batch into the pixel-based SiD
+loss, enabling SiDA to distill a single-step generator. SiDA converges
+significantly faster than its predecessor when distilled from scratch, and
+swiftly improves upon the original model's performance during fine-tuning from
+a pre-distilled SiD generator. This one-step adversarial distillation method
+establishes new benchmarks in generation performance when distilling EDM
+diffusion models, achieving FID scores of 1.110 on ImageNet 64x64. When
+distilling EDM2 models trained on ImageNet 512x512, our SiDA method surpasses
+even the largest teacher model, EDM2-XXL, which achieved an FID of 1.81 using
+classifier-free guidance (CFG) and 63 generation steps. In contrast, SiDA
+achieves FID scores of 2.156 for size XS, 1.669 for S, 1.488 for M, 1.413 for
+L, 1.379 for XL, and 1.366 for XXL, all without CFG and in a single generation
+step. These results highlight substantial improvements across all model sizes.
+Our code is available at https://github.com/mingyuanzhou/SiD/tree/sida.
+
+
+
+
+
+
+
+
+ Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Mohamed M. Sabry Aly, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin
+
+
+ Vision transformers have emerged as a promising alternative to convolutional
+neural networks for various image analysis tasks, offering comparable or
+superior performance. However, one significant drawback of ViTs is their
+resource-intensive nature, leading to increased memory footprint, computation
+complexity, and power consumption. To democratize this high-performance
+technology and make it more environmentally friendly, it is essential to
+compress ViT models, reducing their resource requirements while maintaining
+high performance. In this paper, we introduce a new block-structured pruning to
+address the resource-intensive issue for ViTs, offering a balanced trade-off
+between accuracy and hardware acceleration. Unlike unstructured pruning or
+channel-wise structured pruning, block pruning leverages the block-wise
+structure of linear layers, resulting in more efficient matrix multiplications.
+To optimize this pruning scheme, our paper proposes a novel hardware-aware
+learning objective that simultaneously maximizes speedup and minimizes power
+consumption during inference, tailored to the block sparsity structure. This
+objective eliminates the need for empirical look-up tables and focuses solely
+on reducing parametrized layer connections. Moreover, our paper provides a
+lightweight algorithm to achieve post-training pruning for ViTs, utilizing
+second-order Taylor approximation and empirical optimization to solve the
+proposed hardware-aware objective. Extensive experiments on ImageNet are
+conducted across various ViT architectures, including DeiT-B and DeiT-S,
+demonstrating competitive performance with other pruning methods and achieving
+a remarkable balance between accuracy preservation and power savings.
+Especially, we achieve up to 3.93x and 1.79x speedups on dedicated hardware and
+GPUs respectively for DeiT-B, and also observe an inference power reduction by
+1.4x on real-world GPUs.
+
+
+
+
+
+
+
+ ♻ ☆ Mining and Transferring Feature-Geometry Coherence for Unsupervised
+ Point Cloud Registration NeurIPS2024
+
+
+ Point cloud registration, a fundamental task in 3D vision, has achieved
+remarkable success with learning-based methods in outdoor environments.
+Unsupervised outdoor point cloud registration methods have recently emerged to
+circumvent the need for costly pose annotations. However, they fail to
+establish reliable optimization objectives for unsupervised training, either
+relying on overly strong geometric assumptions, or suffering from poor-quality
+pseudo-labels due to inadequate integration of low-level geometric and
+high-level contextual information. We have observed that in the feature space,
+latent new inlier correspondences tend to cluster around respective positive
+anchors that summarize features of existing inliers. Motivated by this
+observation, we propose a novel unsupervised registration method termed INTEGER
+to incorporate high-level contextual information for reliable pseudo-label
+mining. Specifically, we propose the Feature-Geometry Coherence Mining module
+to dynamically adapt the teacher for each mini-batch of data during training
+and discover reliable pseudo-labels by considering both high-level feature
+representations and low-level geometric cues. Furthermore, we propose
+Anchor-Based Contrastive Learning to facilitate contrastive learning with
+anchors for a robust feature space. Lastly, we introduce a Mixed-Density
+Student to learn density-invariant features, addressing challenges related to
+density variation and low overlap in the outdoor scenario. Extensive
+experiments on KITTI and nuScenes datasets demonstrate that our INTEGER
+achieves competitive performance in terms of accuracy and generalizability.
+
+
+
+ comment: Accepted by NeurIPS2024
+
+
+
+
+
+
+ ♻ ☆ Towards Generalist Robot Policies: What Matters in Building
+ Vision-Language-Action Models
+
+
+ Foundation Vision Language Models (VLMs) exhibit strong capabilities in
+multi-modal representation learning, comprehension, and reasoning. By injecting
+action components into the VLMs, Vision-Language-Action Models (VLAs) can be
+naturally formed and also show promising performance. Existing work has
+demonstrated the effectiveness and generalization of VLAs in multiple scenarios
+and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since
+existing VLAs differ in their backbones, action-prediction formulations, data
+distributions, and training recipes. This leads to a missing piece for a
+systematic understanding of the design choices of VLAs. In this work, we
+disclose the key factors that significantly influence the performance of VLA
+and focus on answering three essential design choices: which backbone to
+select, how to formulate the VLA architectures, and when to add
+cross-embodiment data. The obtained results convince us firmly to explain why
+we need VLA and develop a new family of VLAs, RoboVLMs, which require very few
+manual designs and achieve a new state-of-the-art performance in three
+simulation tasks and real-world experiments. Through our extensive experiments,
+which include over 8 VLM backbones, 4 policy architectures, and over 600
+distinct designed experiments, we provide a detailed guidebook for the future
+design of VLAs. In addition to the study, the highly flexible RoboVLMs
+framework, which supports easy integrations of new VLMs and free combinations
+of various design choices, is made public to facilitate future research. We
+open-source all details, including codes, models, datasets, and toolkits, along
+with detailed training and evaluation recipes at: robovlms.github.io.
+
+
+ Multimodal large language models (MLLMs) excel at generating highly detailed
+captions but often produce hallucinations. Our analysis reveals that existing
+hallucination detection methods struggle with detailed captions. We attribute
+this to the increasing reliance of MLLMs on their generated text, rather than
+the input image, as the sequence length grows. To address this issue, we
+propose a multiagent approach that leverages LLM-MLLM collaboration to correct
+given captions. Additionally, we introduce an evaluation framework and a
+benchmark dataset to facilitate the systematic analysis of detailed captions.
+Our experiments demonstrate that our proposed evaluation method better aligns
+with human judgments of factuality than existing metrics and that existing
+approaches to improve the MLLM factuality may fall short in hyper-detailed
+image captioning tasks. In contrast, our proposed method significantly enhances
+the factual accuracy of captions, even improving those generated by GPT-4V.
+Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating
+that an MLLM's performance on VQA benchmarks may not correlate with its ability
+to generate detailed image captions.
+
+
+
+
+
+
+
+ ♻ ☆ CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained
+ Vision-Language Model
+
+
+
+
+
+
+
+
+ Shuai Zhao, Ruijie Quan, Linchao Zhu, Yi Yang
+
+
+ Pre-trained vision-language models~(VLMs) are the de-facto foundation models
+for various downstream tasks. However, scene text recognition methods still
+prefer backbones pre-trained on a single modality, namely, the visual modality,
+despite the potential of VLMs to serve as powerful scene text readers. For
+example, CLIP can robustly identify regular (horizontal) and irregular
+(rotated, curved, blurred, or occluded) text in images. With such merits, we
+transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet
+effective STR method built upon image and text encoders of CLIP. It has two
+encoder-decoder branches: a visual branch and a cross-modal branch. The visual
+branch provides an initial prediction based on the visual feature, and the
+cross-modal branch refines this prediction by addressing the discrepancy
+between the visual feature and text semantics. To fully leverage the
+capabilities of both branches, we design a dual predict-and-refine decoding
+scheme for inference. We scale CLIP4STR in terms of the model size,
+pre-training data, and training data, achieving state-of-the-art performance on
+13 STR benchmarks. Additionally, a comprehensive empirical study is provided to
+enhance the understanding of the adaptation of CLIP to STR. Our method
+establishes a simple yet strong baseline for future STR research with VLMs.
+
+
+
+ comment: Accepted by T-IP. A PyTorch re-implementation is at
+ https://github.com/VamosC/CLIP4STR (Credit on GitHub@VamosC)
+
+ We propose an image-adaptive object detection method for adverse weather
+conditions such as fog and low-light. Our framework employs differentiable
+preprocessing filters to perform image enhancement suitable for later-stage
+object detections. Our framework introduces two differentiable filters: a
+B\'ezier curve-based pixel-wise (BPW) filter and a kernel-based local (KBL)
+filter. These filters unify the functions of classical image processing filters
+and improve performance of object detection. We also propose a domain-agnostic
+data augmentation strategy using the BPW filter. Our method does not require
+data-specific customization of the filter combinations, parameter ranges, and
+data augmentation. We evaluate our proposed approach, called Enhanced
+Robustness by Unified Image Processing (ERUP)-YOLO, by applying it to the
+YOLOv3 detector. Experiments on adverse weather datasets demonstrate that our
+proposed filters match or exceed the expressiveness of conventional methods and
+our ERUP-YOLO achieved superior performance in a wide range of adverse weather
+conditions, including fog and low-light conditions.
+
+
+
+ comment: Accepted to WACV 2025
+
+
+
+
+
+
+ ♻ ☆ Concept Complement Bottleneck Model for Interpretable Medical Image
+ Diagnosis
+
+
+ Models based on human-understandable concepts have received extensive
+attention to improve model interpretability for trustworthy artificial
+intelligence in the field of medical image analysis. These methods can provide
+convincing explanations for model decisions but heavily rely on the detailed
+annotation of pre-defined concepts. Consequently, they may not be effective in
+cases where concepts or annotations are incomplete or low-quality. Although
+some methods automatically discover effective and new visual concepts rather
+than using pre-defined concepts or could find some human-understandable
+concepts via large Language models, they are prone to veering away from medical
+diagnostic evidence and are challenging to understand. In this paper, we
+propose a concept complement bottleneck model for interpretable medical image
+diagnosis with the aim of complementing the existing concept set and finding
+new concepts bridging the gap between explainable models. Specifically, we
+propose to use concept adapters for specific concepts to mine the concept
+differences and score concepts in their own attention channels to support
+almost fairly concept learning. Then, we devise a concept complement strategy
+to learn new concepts while jointly using known concepts to improve model
+performance. Comprehensive experiments on medical datasets demonstrate that our
+model outperforms the state-of-the-art competitors in concept detection and
+disease diagnosis tasks while providing diverse explanations to ensure model
+interpretability effectively.
+
+
+ Autoregressive (AR) models have achieved state-of-the-art performance in text
+and image generation but suffer from slow generation due to the token-by-token
+process. We ask an ambitious question: can a pre-trained AR model be adapted to
+generate outputs in just one or two steps? If successful, this would
+significantly advance the development and deployment of AR models. We notice
+that existing works that try to speed up AR generation by generating multiple
+tokens at once fundamentally cannot capture the output distribution due to the
+conditional dependencies between tokens, limiting their effectiveness for
+few-step generation. To address this, we propose Distilled Decoding (DD), which
+uses flow matching to create a deterministic mapping from Gaussian distribution
+to the output distribution of the pre-trained AR model. We then train a network
+to distill this mapping, enabling few-step generation. DD doesn't need the
+training data of the original AR model, making it more practical. We evaluate
+DD on state-of-the-art image AR models and present promising results on
+ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step
+generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19
+to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an
+217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In
+both cases, baseline methods completely fail with FID>100. DD also excels on
+text-to-image generation, reducing the generation from 256 steps to 2 for
+LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to
+demonstrate the possibility of one-step generation for image AR models, DD
+challenges the prevailing notion that AR models are inherently slow, and opens
+up new opportunities for efficient AR generation. The project website is at
+https://imagination-research.github.io/distilled-decoding.
+
+
+ Open-vocabulary image segmentation has been advanced through the synergy
+between mask generators and vision-language models like Contrastive
+Language-Image Pre-training (CLIP). Previous approaches focus on generating
+masks while aligning mask features with text embeddings during training. In
+this paper, we observe that relying on generated low-quality masks can weaken
+the alignment of vision and language in regional representations. This
+motivates us to present a new fine-tuning framework, named MaskCLIP++, which
+uses ground-truth masks instead of generated masks to enhance the mask
+classification capability of CLIP. Due to the limited diversity of image
+segmentation datasets with mask annotations, we propose incorporating a
+consistency alignment constraint during fine-tuning, which alleviates
+categorical bias toward the fine-tuning dataset. After low-cost fine-tuning,
+combining with the mask generator in previous state-of-the-art mask-based open
+vocabulary segmentation methods, we achieve performance improvements of +1.7,
++2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20
+datasets, respectively. Code is released at
+https://github.com/HVision-NKU/MaskCLIPpp .
+
+
+
+ comment: 20 pages, 8 figures. Add code link
+
+
+
+
+
+
+ ♻ ☆ A Multimodal Approach For Endoscopic VCE Image Classification Using
+ BiomedCLIP-PubMedBERT
+
+
+ This Paper presents an advanced approach for fine-tuning BiomedCLIP
+PubMedBERT, a multimodal model, to classify abnormalities in Video Capsule
+Endoscopy (VCE) frames, aiming to enhance diagnostic efficiency in
+gastrointestinal healthcare. By integrating the PubMedBERT language model with
+a Vision Transformer (ViT) to process endoscopic images, our method categorizes
+images into ten specific classes: angioectasia, bleeding, erosion, erythema,
+foreign body, lymphangiectasia, polyp, ulcer, worms, and normal. Our workflow
+incorporates image preprocessing and fine-tunes the BiomedCLIP model to
+generate high-quality embeddings for both visual and textual inputs, aligning
+them through similarity scoring for classification. Performance metrics,
+including classification, accuracy, recall, and F1 score, indicate the models
+strong ability to accurately identify abnormalities in endoscopic frames,
+showing promise for practical use in clinical diagnostics.
+
+
+
+
+
+
+
+ ♻ ☆ ProCNS: Progressive Prototype Calibration and Noise Suppression for
+ Weakly-Supervised Medical Image Segmentation
+
+
+
+
+
+
+
+
+ Y. Liu, L. Lin, K. K. Y. Wong, X. Tang
+
+
+ Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate
+the conflict between annotation cost and model performance by adopting sparse
+annotation formats (e.g., point, scribble, block, etc.). Typical approaches
+attempt to exploit anatomy and topology priors to directly expand sparse
+annotations into pseudo-labels. However, due to a lack of attention to the
+ambiguous edges in medical images and insufficient exploration of sparse
+supervision, existing approaches tend to generate erroneous and overconfident
+pseudo proposals in noisy regions, leading to cumulative model error and
+performance degradation. In this work, we propose a novel WSS approach, named
+ProCNS, encompassing two synergistic modules devised with the principles of
+progressive prototype calibration and noise suppression. Specifically, we
+design a Prototype-based Regional Spatial Affinity (PRSA) loss to maximize the
+pair-wise affinities between spatial and semantic elements, providing our model
+of interest with more reliable guidance. The affinities are derived from the
+input images and the prototype-refined predictions. Meanwhile, we propose an
+Adaptive Noise Perception and Masking (ANPM) module to obtain more enriched and
+representative prototype representations, which adaptively identifies and masks
+noisy regions within the pseudo proposals, reducing potential erroneous
+interference during prototype computation. Furthermore, we generate specialized
+soft pseudo-labels for the noisy regions identified by ANPM, providing
+supplementary supervision. Extensive experiments on six medical image
+segmentation tasks involving different modalities demonstrate that the proposed
+framework significantly outperforms representative state-of-the-art methods.
+
+
+ The study of Cloth-Changing Person Re-identification (CC-ReID) focuses on
+retrieving specific pedestrians when their clothing has changed, typically
+under the assumption that the entire pedestrian images are visible. Pedestrian
+images in real-world scenarios, however, are often partially obscured by
+obstacles, presenting a significant challenge to existing CC-ReID systems. In
+this paper, we introduce a more challenging task termed Occluded Cloth-Changing
+Person Re-Identification (OC4-ReID), which simultaneously addresses two
+challenges of clothing changes and occlusion. Concretely, we construct two new
+datasets, Occ-LTCC and Occ-PRCC, based on original CC-ReID datasets to include
+random occlusions of key pedestrians components (e.g., head, torso). Moreover,
+a novel benchmark is proposed for OC4-ReID incorporating a Train-Test Micro
+Granularity Screening (T2MGS) module to mitigate the influence of occlusion and
+proposing a Part-Robust Triplet (PRT) loss for partial features learning.
+Comprehensive experiments on the proposed datasets, as well as on two CC-ReID
+benchmark datasets demonstrate the superior performance of proposed method
+against other state-of-the-art methods. The codes and datasets are available
+at: https://github.com/1024AILab/OC4-ReID.
+
+
+
+
+
+
+
+ ♻ ☆ Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension
+
+
+ Recent advances in Large Language Models (LLMs) have catalyzed the
+development of Large Multimodal Models (LMMs). However, existing research
+primarily focuses on tuning language and image instructions, ignoring the
+critical pretraining phase where models learn to process textual and visual
+modalities jointly. In this paper, we propose a new pretraining paradigm for
+LMMs to enhance the visual comprehension capabilities of LLMs by introducing a
+novel cross-modal comprehension stage. Specifically, we design a dynamically
+learnable prompt token pool and employ the Hungarian algorithm to replace part
+of the original visual tokens with the most relevant prompt tokens. Then, we
+conceptualize visual tokens as analogous to a "foreign language" for the LLMs
+and propose a mixed attention mechanism with bidirectional visual attention and
+unidirectional textual attention to comprehensively enhance the understanding
+of visual tokens. Meanwhile, we integrate a detailed caption generation task,
+leveraging rich descriptions to further facilitate LLMs in understanding visual
+semantic information. After pretraining on 1.5 million publicly accessible
+data, we present a new foundation model called Croc. Experimental results
+demonstrate that Croc achieves new state-of-the-art performance on massive
+vision-language benchmarks. To support reproducibility and facilitate further
+research, we release the training code and pre-trained model weights at
+https://github.com/deepglint/Croc.
+
+
+
+ comment: 14 pages, 12 figures
+
+
+
+
+
+
+ ♻ ☆ An Evaluation Framework for Product Images Background Inpainting based
+ on Human Feedback and Product Consistency AAAI2025
+
+
+
+
+
+
+
+
+ Yuqi Liang, Jun Luo, Xiaoxi Guo, Jianqi Bi
+
+
+ In product advertising applications, the automated inpainting of backgrounds
+utilizing AI techniques in product images has emerged as a significant task.
+However, the techniques still suffer from issues such as inappropriate
+background and inconsistent product in generated product images, and existing
+approaches for evaluating the quality of generated product images are mostly
+inconsistent with human feedback causing the evaluation for this task to depend
+on manual annotation. To relieve the issues above, this paper proposes Human
+Feedback and Product Consistency (HFPC), which can automatically assess the
+generated product images based on two modules. Firstly, to solve inappropriate
+backgrounds, human feedback on 44,000 automated inpainting product images is
+collected to train a reward model based on multi-modal features extracted from
+BLIP and comparative learning. Secondly, to filter generated product images
+containing inconsistent products, a fine-tuned segmentation model is employed
+to segment the product of the original and generated product images and then
+compare the differences between the above two. Extensive experiments have
+demonstrated that HFPC can effectively evaluate the quality of generated
+product images and significantly reduce the expense of manual annotation.
+Moreover, HFPC achieves state-of-the-art(96.4% in precision) in comparison to
+other open-source visual-quality-assessment models. Dataset and code are
+available at:
+https://github.com/created-Bi/background_inpainting_products_dataset
+
+
+ Semi-supervised semantic segmentation has attracted considerable attention
+for its ability to mitigate the reliance on extensive labeled data. However,
+existing consistency regularization methods only utilize high certain pixels
+with prediction confidence surpassing a fixed threshold for training, failing
+to fully leverage the potential supervisory information within the network.
+Therefore, this paper proposes the Uncertainty-participation Context
+Consistency Learning (UCCL) method to explore richer supervisory signals.
+Specifically, we first design the semantic backpropagation update (SBU)
+strategy to fully exploit the knowledge from uncertain pixel regions, enabling
+the model to learn consistent pixel-level semantic information from those
+areas. Furthermore, we propose the class-aware knowledge regulation (CKR)
+module to facilitate the regulation of class-level semantic features across
+different augmented views, promoting consistent learning of class-level
+semantic information within the encoder. Experimental results on two public
+benchmarks demonstrate that our proposed method achieves state-of-the-art
+performance. Our code is available at https://github.com/YUKEKEJAN/UCCL.
+
+
+
+ comment: To be published in ICASSP
+
+
+
+
+
+
+ ♻ ☆ Prediction Exposes Your Face: Black-box Model Inversion via Prediction
+ Alignment ECCV 2024
+
+
+ Model inversion (MI) attack reconstructs the private training data of a
+target model given its output, posing a significant threat to deep learning
+models and data privacy. On one hand, most of existing MI methods focus on
+searching for latent codes to represent the target identity, yet this iterative
+optimization-based scheme consumes a huge number of queries to the target
+model, making it unrealistic especially in black-box scenario. On the other
+hand, some training-based methods launch an attack through a single forward
+inference, whereas failing to directly learn high-level mappings from
+prediction vectors to images. Addressing these limitations, we propose a novel
+Prediction-to-Image (P2I) method for black-box MI attack. Specifically, we
+introduce the Prediction Alignment Encoder to map the target model's output
+prediction into the latent code of StyleGAN. In this way, prediction vector
+space can be well aligned with the more disentangled latent space, thus
+establishing a connection between prediction vectors and the semantic facial
+features. During the attack phase, we further design the Aligned Ensemble
+Attack scheme to integrate complementary facial attributes of target identity
+for better reconstruction. Experimental results show that our method
+outperforms other SOTAs, e.g.,compared with RLB-MI, our method improves attack
+accuracy by 8.5% and reduces query numbers by 99% on dataset CelebA.
+
+
+ Although quantization for linear layers has been widely used, its application
+to accelerate the attention process remains limited. To further enhance the
+efficiency of attention computation compared to SageAttention while maintaining
+precision, we propose SageAttention2, which utilizes significantly faster 4-bit
+matrix multiplication (Matmul) alongside additional precision-enhancing
+techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a
+hardware-friendly thread-level granularity and quantize matrixes $(\widetilde
+P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the
+accuracy of INT4 $QK$. Third, we propose to use an FP32 Matmul buffer for $PV$
+to enhance the accuracy of FP8 $\widetilde PV$. The operations per second (OPS)
+of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on
+RTX4090, respectively. Comprehensive experiments confirm that our approach
+incurs negligible end-to-end metrics loss across diverse models, including
+those for large language processing, image generation, and video generation.
+The codes are available at https://github.com/thu-ml/SageAttention.
+
+
+
+
+
+
+
+ ♻ ☆ LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding
+
+
+ Applying Gaussian Splatting to perception tasks for 3D scene understanding is
+becoming increasingly popular. Most existing works primarily focus on rendering
+2D feature maps from novel viewpoints, which leads to an imprecise 3D language
+field with outlier languages, ultimately failing to align objects in 3D space.
+By utilizing masked images for feature extraction, these approaches also lack
+essential contextual information, leading to inaccurate feature representation.
+To this end, we propose a Language-Embedded Surface Field (LangSurf), which
+accurately aligns the 3D language fields with the surface of objects,
+facilitating precise 2D and 3D segmentation with text query, widely expanding
+the downstream tasks such as removal and editing. The core of LangSurf is a
+joint training strategy that flattens the language Gaussian on the object
+surfaces using geometry supervision and contrastive losses to assign accurate
+language features to the Gaussians of objects. In addition, we also introduce
+the Hierarchical-Context Awareness Module to extract features at the image
+level for contextual information then perform hierarchical mask pooling using
+masks segmented by SAM to obtain fine-grained language features in different
+hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic
+segmentation demonstrate that LangSurf outperforms the previous
+state-of-the-art method LangSplat by a large margin. As shown in Fig. 1, our
+method is capable of segmenting objects in 3D space, thus boosting the
+effectiveness of our approach in instance recognition, removal, and editing,
+which is also supported by comprehensive experiments.
+\url{https://langsurf.github.io}.
+
+
+ Fisheye image rectification aims to correct distortions in images taken with
+fisheye cameras. Although current models show promising results on images with
+a similar degree of distortion as the training data, they will produce
+sub-optimal results when the degree of distortion changes and without
+retraining. The lack of generalization ability for dealing with varying degrees
+of distortion limits their practical application. In this paper, we take one
+step further to enable effective distortion rectification for images with
+varying degrees of distortion without retraining. We propose a novel
+Query-Based Controllable Distortion Rectification network for fisheye images
+(QueryCDR). In particular, we first present the Distortion-aware Learnable
+Query Mechanism (DLQM), which defines the latent spatial relationships for
+different distortion degrees as a series of learnable queries. Each query can
+be learned to obtain position-dependent rectification control conditions,
+providing control over the rectification process. Then, we propose two kinds of
+controllable modulating blocks to enable the control conditions to guide the
+modulation of the distortion features better. These core components cooperate
+with each other to effectively boost the generalization ability of the model at
+varying degrees of distortion. Extensive experiments on fisheye image datasets
+with different distortion degrees demonstrate our approach achieves
+high-quality and controllable distortion rectification.
+
+
+
+ comment: ECCV2024
+
+
+
+
+
+
+ ♻ ☆ A Pioneering Neural Network Method for Efficient and Robust Fuel
+ Sloshing Simulation in Aircraft AAAI
+
+
+ Simulating fuel sloshing within aircraft tanks during flight is crucial for
+aircraft safety research. Traditional methods based on Navier-Stokes equations
+are computationally expensive. In this paper, we treat fluid motion as point
+cloud transformation and propose the first neural network method specifically
+designed for simulating fuel sloshing in aircraft. This model is also the deep
+learning model that is the first to be capable of stably modeling fluid
+particle dynamics in such complex scenarios. Our triangle feature fusion design
+achieves an optimal balance among fluid dynamics modeling, momentum
+conservation constraints, and global stability control. Additionally, we
+constructed the Fueltank dataset, the first dataset for aircraft fuel surface
+sloshing. It comprises 320,000 frames across four typical tank types and covers
+a wide range of flight maneuvers, including multi-directional rotations. We
+conducted comprehensive experiments on both our dataset and the take-off
+scenario of the aircraft. Compared to existing neural network-based fluid
+simulation algorithms, we significantly enhanced accuracy while maintaining
+high computational speed. Compared to traditional SPH methods, our speed
+improved approximately 10 times. Furthermore, compared to traditional fluid
+simulation software such as Flow3D, our computation speed increased by more
+than 300 times.
+
+
+
+ comment: This paper has been accepted by AAAI Conference on Artificial
+ Intelligence (AAAI-25)
+
+
+
+
+
+
+ ♻ ☆ Learning Mutual Excitation for Hand-to-Hand and Human-to-Human
+ Interaction Recognition
+
+
+
+
+
+
+
+
+ Mengyuan Liu, Chen Chen, Songtao Wu, Fanyang Meng, Hong Liu
+
+
+ Recognizing interactive actions, including hand-to-hand interaction and
+human-to-human interaction, has attracted increasing attention for various
+applications in the field of video analysis and human-robot interaction.
+Considering the success of graph convolution in modeling topology-aware
+features from skeleton data, recent methods commonly operate graph convolution
+on separate entities and use late fusion for interactive action recognition,
+which can barely model the mutual semantic relationships between pairwise
+entities. To this end, we propose a mutual excitation graph convolutional
+network (me-GCN) by stacking mutual excitation graph convolution (me-GC)
+layers. Specifically, me-GC uses a mutual topology excitation module to firstly
+extract adjacency matrices from individual entities and then adaptively model
+the mutual constraints between them. Moreover, me-GC extends the above idea and
+further uses a mutual feature excitation module to extract and merge deep
+features from pairwise entities. Compared with graph convolution, our proposed
+me-GC gradually learns mutual information in each layer and each stage of graph
+convolution operations. Extensive experiments on a challenging hand-to-hand
+interaction dataset, i.e., the Assembely101 dataset, and two large-scale
+human-to-human interaction datasets, i.e., NTU60-Interaction and
+NTU120-Interaction consistently verify the superiority of our proposed method,
+which outperforms the state-of-the-art GCN-based and Transformer-based methods.
+
+
+
+
+
+
+
+ ♻ ☆ A Phase Transition in Diffusion Models Reveals the Hierarchical Nature
+ of Data
+
+
+ Understanding the structure of real data is paramount in advancing modern
+deep-learning methodologies. Natural data such as images are believed to be
+composed of features organized in a hierarchical and combinatorial manner,
+which neural networks capture during learning. Recent advancements show that
+diffusion models can generate high-quality images, hinting at their ability to
+capture this underlying compositional structure. We study this phenomenon in a
+hierarchical generative model of data. We find that the backward diffusion
+process acting after a time $t$ is governed by a phase transition at some
+threshold time, where the probability of reconstructing high-level features,
+like the class of an image, suddenly drops. Instead, the reconstruction of
+low-level features, such as specific details of an image, evolves smoothly
+across the whole diffusion process. This result implies that at times beyond
+the transition, the class has changed, but the generated sample may still be
+composed of low-level elements of the initial image. We validate these
+theoretical insights through numerical experiments on class-unconditional
+ImageNet diffusion models. Our analysis characterizes the relationship between
+time and scale in diffusion models and puts forward generative models as
+powerful tools to model combinatorial data properties.
+
+
+ The performance of computer vision models in certain real-world applications
+(e.g., rare wildlife observation) is limited by the small number of available
+images. Expanding datasets using pre-trained generative models is an effective
+way to address this limitation. However, since the automatic generation process
+is uncontrollable, the generated images are usually limited in diversity, and
+some of them are undesired. In this paper, we propose a human-guided image
+generation method for more controllable dataset expansion. We develop a
+multi-modal projection method with theoretical guarantees to facilitate the
+exploration of both the original and generated images. Based on the
+exploration, users refine the prompts and re-generate images for better
+performance. Since directly refining the prompts is challenging for novice
+users, we develop a sample-level prompt refinement method to make it easier.
+With this method, users only need to provide sample-level feedback (e.g., which
+samples are undesired) to obtain better prompts. The effectiveness of our
+method is demonstrated through the quantitative evaluation of the multi-modal
+projection method, improved model performance in the case study for both
+classification and object detection tasks, and positive feedback from the
+experts.
+
+
+
+ comment: Accepted by TVCG2025
+
+
+
+
+
+
+ ♻ ☆ ODMixer: Fine-grained Spatial-temporal MLP for Metro Origin-Destination
+ Prediction
+
+
+
+
+
+
+
+
+ Yang Liu, Binglin Chen, Yongsen Zheng, Lechao Cheng, Guanbin Li, Liang Lin
+
+
+ Metro Origin-Destination (OD) prediction is a crucial yet challenging
+spatial-temporal prediction task in urban computing, which aims to accurately
+forecast cross-station ridership for optimizing metro scheduling and enhancing
+overall transport efficiency. Analyzing fine-grained and comprehensive
+relations among stations effectively is imperative for metro OD prediction.
+However, existing metro OD models either mix information from multiple OD pairs
+from the station's perspective or exclusively focus on a subset of OD pairs.
+These approaches may overlook fine-grained relations among OD pairs, leading to
+difficulties in predicting potential anomalous conditions. To address these
+challenges, we learn traffic evolution from the perspective of all OD pairs and
+propose a fine-grained spatial-temporal MLP architecture for metro OD
+prediction, namely ODMixer. Specifically, our ODMixer has double-branch
+structure and involves the Channel Mixer, the Multi-view Mixer, and the
+Bidirectional Trend Learner. The Channel Mixer aims to capture short-term
+temporal relations among OD pairs, the Multi-view Mixer concentrates on
+capturing spatial relations from both origin and destination perspectives. To
+model long-term temporal relations, we introduce the Bidirectional Trend
+Learner. Extensive experiments on two large-scale metro OD prediction datasets
+HZMOD and SHMO demonstrate the advantages of our ODMixer. Our code is available
+at https://github.com/KLatitude/ODMixer.
+
+
+
+ comment: Code is available at https://github.com/KLatitude/ODMixer
+
+ Fine-grained emotion recognition (FER) plays a vital role in various fields,
+such as disease diagnosis, personalized recommendations, and multimedia mining.
+However, existing FER methods face three key challenges in real-world
+applications: (i) they rely on large amounts of continuously annotated data to
+ensure accuracy since emotions are complex and ambiguous in reality, which is
+costly and time-consuming; (ii) they cannot capture the temporal heterogeneity
+caused by changing emotion patterns, because they usually assume that the
+temporal correlation within sampling periods is the same; (iii) they do not
+consider the spatial heterogeneity of different FER scenarios, that is, the
+distribution of emotion information in different data may have bias or
+interference. To address these challenges, we propose a Spatio-Temporal
+Fuzzy-oriented Multi-modal Meta-learning framework (ST-F2M). Specifically,
+ST-F2M first divides the multi-modal videos into multiple views, and each view
+corresponds to one modality of one emotion. Multiple randomly selected views
+for the same emotion form a meta-training task. Next, ST-F2M uses an integrated
+module with spatial and temporal convolutions to encode the data of each task,
+reflecting the spatial and temporal heterogeneity. Then it adds fuzzy semantic
+information to each task based on generalized fuzzy rules, which helps handle
+the complexity and ambiguity of emotions. Finally, ST-F2M learns
+emotion-related general meta-knowledge through meta-recurrent neural networks
+to achieve fast and robust fine-grained emotion recognition. Extensive
+experiments show that ST-F2M outperforms various state-of-the-art methods in
+terms of accuracy and model efficiency. In addition, we construct ablation
+studies and further analysis to explore why ST-F2M performs well.
+
+
+
+ comment: 13 pages, Submitted to TMM in 30-May-2024
+
+
+
+
+
+
+ ♻ ☆ Revisiting Lesion Tracking in 3D Total Body Photography
+
+
+ Melanoma is the most deadly form of skin cancer. Tracking the evolution of
+nevi and detecting new lesions across the body is essential for the early
+detection of melanoma. Despite prior work on longitudinal tracking of skin
+lesions in 3D total body photography, there are still several challenges,
+including 1) low accuracy for finding correct lesion pairs across scans, 2)
+sensitivity to noisy lesion detection, and 3) lack of large-scale datasets with
+numerous annotated lesion pairs. We propose a framework that takes in a pair of
+3D textured meshes, matches lesions in the context of total body photography,
+and identifies unmatchable lesions. We start by computing correspondence maps
+bringing the source and target meshes to a template mesh. Using these maps to
+define source/target signals over the template domain, we construct a flow
+field aligning the mapped signals. The initial correspondence maps are then
+refined by advecting forward/backward along the vector field. Finally, lesion
+assignment is performed using the refined correspondence maps. We propose the
+first large-scale dataset for skin lesion tracking with 25K lesion pairs across
+198 subjects. The proposed method achieves a success rate of 89.9% (at 10 mm
+criterion) for all pairs of annotated lesions and a matching accuracy of 98.2%
+for subjects with more than 200 lesions.
+
+
+
+ comment: v2
+
+
+
+
+
+
+ ♻ ☆ HaSPeR: An Image Repository for Hand Shadow Puppet Recognition
+
+
+ Hand shadow puppetry, also known as shadowgraphy or ombromanie, is a form of
+theatrical art and storytelling where hand shadows are projected onto flat
+surfaces to create illusions of living creatures. The skilled performers create
+these silhouettes by hand positioning, finger movements, and dexterous gestures
+to resemble shadows of animals and objects. Due to the lack of practitioners
+and a seismic shift in people's entertainment standards, this art form is on
+the verge of extinction. To facilitate its preservation and proliferate it to a
+wider audience, we introduce ${\rm H{\small A}SP{\small E}R}$, a novel dataset
+consisting of 15,000 images of hand shadow puppets across 15 classes extracted
+from both professional and amateur hand shadow puppeteer clips. We provide a
+detailed statistical analysis of the dataset and employ a range of pretrained
+image classification models to establish baselines. Our findings show a
+substantial performance superiority of skip-connected convolutional models over
+attention-based transformer architectures. We also find that lightweight
+models, such as MobileNetV2, suited for mobile applications and embedded
+devices, perform comparatively well. We surmise that such low-latency
+architectures can be useful in developing ombromanie teaching tools, and we
+create a prototype application to explore this surmission. Keeping the
+best-performing model ResNet34 under the limelight, we conduct comprehensive
+feature-spatial, explainability, and error analyses to gain insights into its
+decision-making process. To the best of our knowledge, this is the first
+documented dataset and research endeavor to preserve this dying art for future
+generations, with computer vision approaches. Our code and data will be
+publicly available.
+
+
+
+ comment: Submitted to IEEE Transactions on Artificial Intelligence (IEEE TAI),
+ 13 pages, 105 figures, 2 tables
+
+
+
+
+
+
+
+ Zhili Shen, Chenxin Diao, Pavlos Vougiouklis, Pascual Merita, Shriram Piramanayagam, Damien Graux, Dandan Tu, Zeren Jiang, Ruofei Lai, Yang Ren, Jeff Z. Pan
+
+
+ Retrieval-augmented generation systems rely on effective document retrieval
+capabilities. By design, conventional sparse or dense retrievers face
+challenges in multi-hop retrieval scenarios. In this paper, we present GeAR,
+which advances RAG performance through two key innovations: (i) graph
+expansion, which enhances any conventional base retriever, such as BM25, and
+(ii) an agent framework that incorporates graph expansion. Our evaluation
+demonstrates GeAR's superior retrieval performance on three multi-hop question
+answering datasets. Additionally, our system achieves state-of-the-art results
+with improvements exceeding 10% on the challenging MuSiQue dataset, while
+requiring fewer tokens and iterations compared to other multi-step retrieval
+systems.
+
+
+ Interactive Recommendation (IR) has gained significant attention recently for
+its capability to quickly capture dynamic interest and optimize both short and
+long term objectives. IR agents are typically implemented through Deep
+Reinforcement Learning (DRL), because DRL is inherently compatible with the
+dynamic nature of IR. However, DRL is currently not perfect for IR. Due to the
+large action space and sample inefficiency problem, training DRL recommender
+agents is challenging. The key point is that useful features cannot be
+extracted as high-quality representations for the recommender agent to optimize
+its policy. To tackle this problem, we propose Contrastive Representation for
+Interactive Recommendation (CRIR). CRIR efficiently extracts latent, high-level
+preference ranking features from explicit interaction, and leverages the
+features to enhance users' representation. Specifically, the CRIR provides
+representation through one representation network, and refines it through our
+proposed Preference Ranking Contrastive Learning (PRCL). The key insight of
+PRCL is that it can perform contrastive learning without relying on
+computations involving high-level representations or large potential action
+sets. Furthermore, we also propose a data exploiting mechanism and an agent
+training mechanism to better adapt CRIR to the DRL backbone. Extensive
+experiments have been carried out to show our method's superior improvement on
+the sample efficiency while training an DRL-based IR agent.
+
+
+ Although prevailing supervised and self-supervised learning (SSL)-augmented
+sequential recommendation (SeRec) models have achieved improved performance
+with powerful neural network architectures, we argue that they still suffer
+from two limitations: (1) Preference Drift, where models trained on past data
+can hardly accommodate evolving user preference; and (2) Implicit Memory, where
+head patterns dominate parametric learning, making it harder to recall long
+tails. In this work, we explore retrieval augmentation in SeRec, to address
+these limitations. To this end, we propose a Retrieval-Augmented Sequential
+Recommendation framework, named RaSeRec, the main idea of which is to maintain
+a dynamic memory bank to accommodate preference drifts and retrieve relevant
+memories to augment user modeling explicitly. It consists of two stages: (i)
+collaborative-based pre-training, which learns to recommend and retrieve; (ii)
+retrieval-augmented fine-tuning, which learns to leverage retrieved memories.
+Extensive experiments on three datasets fully demonstrate the superiority and
+effectiveness of RaSeRec.
+
+
+ This study introduces Bidirectional Topic Matching (BTM), a novel method for
+cross-corpus topic modeling that quantifies thematic overlap and divergence
+between corpora. BTM is a flexible framework that can incorporate various topic
+modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet
+Allocation (LDA). BTM employs a dual-model approach, training separate topic
+models for each corpus and applying them reciprocally to enable comprehensive
+cross-corpus comparisons. This methodology facilitates the identification of
+shared themes and unique topics, providing nuanced insights into thematic
+relationships. Validation against cosine similarity-based methods demonstrates
+the robustness of BTM, with strong agreement metrics and distinct advantages in
+handling outlier topics. A case study on climate news articles showcases BTM's
+utility, revealing significant thematic overlaps and distinctions between
+corpora focused on climate change and climate action. BTM's flexibility and
+precision make it a valuable tool for diverse applications, from political
+discourse analysis to interdisciplinary studies. By integrating shared and
+unique topic analyses, BTM offers a comprehensive framework for exploring
+thematic relationships, with potential extensions to multilingual and dynamic
+datasets. This work highlights BTM's methodological contributions and its
+capacity to advance discourse analysis across various domains.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ☆ An Automatic Graph Construction Framework based on Large Language Models
+ for Recommendation
+
+
+ Graph neural networks (GNNs) have emerged as state-of-the-art methods to
+learn from graph-structured data for recommendation. However, most existing
+GNN-based recommendation methods focus on the optimization of model structures
+and learning strategies based on pre-defined graphs, neglecting the importance
+of the graph construction stage. Earlier works for graph construction usually
+rely on speciffic rules or crowdsourcing, which are either too simplistic or
+too labor-intensive. Recent works start to utilize large language models (LLMs)
+to automate the graph construction, in view of their abundant open-world
+knowledge and remarkable reasoning capabilities. Nevertheless, they generally
+suffer from two limitations: (1) invisibility of global view (e.g., overlooking
+contextual information) and (2) construction inefficiency. To this end, we
+introduce AutoGraph, an automatic graph construction framework based on LLMs
+for recommendation. Specifically, we first use LLMs to infer the user
+preference and item knowledge, which is encoded as semantic vectors. Next, we
+employ vector quantization to extract the latent factors from the semantic
+vectors. The latent factors are then incorporated as extra nodes to link the
+user/item nodes, resulting in a graph with in-depth global-view semantics. We
+further design metapath-based message aggregation to effectively aggregate the
+semantic and collaborative information. The framework is model-agnostic and
+compatible with different backbone models. Extensive experiments on three
+real-world datasets demonstrate the efficacy and efffciency of AutoGraph
+compared to existing baseline methods. We have deployed AutoGraph in Huawei
+advertising platform, and gain a 2.69% improvement on RPM and a 7.31%
+improvement on eCPM in the online A/B test. Currently AutoGraph has been used
+as the main trafffc model, serving hundreds of millions of people.
+
+
+
+ comment: Under review
+
+
+
+
+
+
+ ☆ Efficient Long Context Language Model Retrieval with Compression
+
+
+
+
+
+
+
+
+ Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang
+
+
+ Long Context Language Models (LCLMs) have emerged as a new paradigm to
+perform Information Retrieval (IR), which enables the direct ingestion and
+retrieval of information by processing an entire corpus in their single
+context, showcasing the potential to surpass traditional sparse and dense
+retrieval methods. However, processing a large number of passages within
+in-context for retrieval is computationally expensive, and handling their
+representations during inference further exacerbates the processing time; thus,
+we aim to make LCLM retrieval more efficient and potentially more effective
+with passage compression. Specifically, we propose a new compression approach
+tailored for LCLM retrieval, which is trained to maximize the retrieval
+performance while minimizing the length of the compressed passages. To
+accomplish this, we generate the synthetic data, where compressed passages are
+automatically created and labeled as chosen or rejected according to their
+retrieval success for a given query, and we train the proposed Compression
+model for Long context Retrieval (CoLoR) with this data via preference
+optimization while adding the length regularization loss on top of it to
+enforce brevity. Through extensive experiments on 9 datasets, we show that
+CoLoR improves the retrieval performance by 6% while compressing the in-context
+size by a factor of 1.91.
+
+
+ Sequential recommendation (SR) systems have evolved significantly over the
+past decade, transitioning from traditional collaborative filtering to deep
+learning approaches and, more recently, to large language models (LLMs). While
+the adoption of LLMs has driven substantial advancements, these models
+inherently lack collaborative filtering information, relying primarily on
+textual content data neglecting other modalities and thus failing to achieve
+optimal recommendation performance. To address this limitation, we propose
+Molar, a Multimodal large language sequential recommendation framework that
+integrates multiple content modalities with ID information to capture
+collaborative signals effectively. Molar employs an MLLM to generate unified
+item representations from both textual and non-textual data, facilitating
+comprehensive multimodal modeling and enriching item embeddings. Additionally,
+it incorporates collaborative filtering signals through a post-alignment
+mechanism, which aligns user representations from content-based and ID-based
+models, ensuring precise personalization and robust performance. By seamlessly
+combining multimodal content with collaborative filtering insights, Molar
+captures both user interests and contextual semantics, leading to superior
+recommendation accuracy. Extensive experiments validate that Molar
+significantly outperforms traditional and LLM-based baselines, highlighting its
+strength in utilizing multimodal data and collaborative signals for sequential
+recommendation tasks. The source code is available at
+https://anonymous.4open.science/r/Molar-8B06/.
+
+
+
+
+
+
+
+ ☆ Unlocking the Hidden Treasures: Enhancing Recommendations with Unlabeled
+ Data
+
+
+ Collaborative filtering (CF) stands as a cornerstone in recommender systems,
+yet effectively leveraging the massive unlabeled data presents a significant
+challenge. Current research focuses on addressing the challenge of unlabeled
+data by extracting a subset that closely approximates negative samples.
+Regrettably, the remaining data are overlooked, failing to fully integrate this
+valuable information into the construction of user preferences. To address this
+gap, we introduce a novel positive-neutral-negative (PNN) learning paradigm.
+PNN introduces a neutral class, encompassing intricate items that are
+challenging to categorize directly as positive or negative samples. By training
+a model based on this triple-wise partial ranking, PNN offers a promising
+solution to learning complex user preferences. Through theoretical analysis, we
+connect PNN to one-way partial AUC (OPAUC) to validate its efficacy.
+Implementing the PNN paradigm is, however, technically challenging because: (1)
+it is difficult to classify unlabeled data into neutral or negative in the
+absence of supervised signals; (2) there does not exist any loss function that
+can handle set-level triple-wise ranking relationships. To address these
+challenges, we propose a semi-supervised learning method coupled with a
+user-aware attention model for knowledge acquisition and classification
+refinement. Additionally, a novel loss function with a two-step centroid
+ranking approach enables handling set-level rankings. Extensive experiments on
+four real-world datasets demonstrate that, when combined with PNN, a wide range
+of representative CF models can consistently and significantly boost their
+performance. Even with a simple matrix factorization, PNN can achieve
+comparable performance to sophisticated graph neutral networks.
+
+
+
+
+
+
+
+ ☆ From Pairwise to Ranking: Climbing the Ladder to Ideal Collaborative
+ Filtering with Pseudo-Ranking
+
+
+
+
+
+
+
+
+ Yuhan Zhao, Rui Chen, Li Chen, Shuang Zhang, Qilong Han, Hongtao Song
+
+
+ Intuitively, an ideal collaborative filtering (CF) model should learn from
+users' full rankings over all items to make optimal top-K recommendations. Due
+to the absence of such full rankings in practice, most CF models rely on
+pairwise loss functions to approximate full rankings, resulting in an immense
+performance gap. In this paper, we provide a novel analysis using the multiple
+ordinal classification concept to reveal the inevitable gap between a pairwise
+approximation and the ideal case. However, bridging the gap in practice
+encounters two formidable challenges: (1) none of the real-world datasets
+contains full ranking information; (2) there does not exist a loss function
+that is capable of consuming ranking information. To overcome these challenges,
+we propose a pseudo-ranking paradigm (PRP) that addresses the lack of ranking
+information by introducing pseudo-rankings supervised by an original noise
+injection mechanism. Additionally, we put forward a new ranking loss function
+designed to handle ranking information effectively. To ensure our method's
+robustness against potential inaccuracies in pseudo-rankings, we equip the
+ranking loss function with a gradient-based confidence mechanism to detect and
+mitigate abnormal gradients. Extensive experiments on four real-world datasets
+demonstrate that PRP significantly outperforms state-of-the-art methods.
+
+
+
+
+
+
+
+
+ Tuan-Nghia Bui, Huy-Son Nguyen, Cam-Van Nguyen Thi, Hoang-Quynh Le, Duc-Trong Le
+
+
+ Bundle recommendation aims to suggest a set of interconnected items to users.
+However, diverse interaction types and sparse interaction matrices often pose
+challenges for previous approaches in accurately predicting user-bundle
+adoptions. Inspired by the distant supervision strategy and generative
+paradigm, we propose BRIDGE, a novel framework for bundle recommendation. It
+consists of two main components namely the correlation-based item clustering
+and the pseudo bundle generation modules. Inspired by the distant supervision
+approach, the former is to generate more auxiliary information, e.g.,
+instructive item clusters, for training without using external data. This
+information is subsequently aggregated with collaborative signals from user
+historical interactions to create pseudo `ideal' bundles. This capability
+allows BRIDGE to explore all aspects of bundles, rather than being limited to
+existing real-world bundles. It effectively bridging the gap between user
+imagination and predefined bundles, hence improving the bundle recommendation
+performance. Experimental results validate the superiority of our models over
+state-of-the-art ranking-based methods across five benchmark datasets.
+
+
+ The item cold-start problem is crucial for online recommender systems, as the
+success of the cold-start phase determines whether items can transition into
+popular ones. Prompt learning, a powerful technique used in natural language
+processing (NLP) to address zero- or few-shot problems, has been adapted for
+recommender systems to tackle similar challenges. However, existing methods
+typically rely on content-based properties or text descriptions for prompting,
+which we argue may be suboptimal for cold-start recommendations due to 1)
+semantic gaps with recommender tasks, 2) model bias caused by warm-up items
+contribute most of the positive feedback to the model, which is the core of the
+cold-start problem that hinders the recommender quality on cold-start items. We
+propose to leverage high-value positive feedback, termed pinnacle feedback as
+prompt information, to simultaneously resolve the above two problems. We
+experimentally prove that compared to the content description proposed in
+existing works, the positive feedback is more suitable to serve as prompt
+information by bridging the semantic gaps. Besides, we propose item-wise
+personalized prompt networks to encode pinnaclce feedback to relieve the model
+bias by the positive feedback dominance problem. Extensive experiments on four
+real-world datasets demonstrate the superiority of our model over
+state-of-the-art methods. Moreover, PROMO has been successfully deployed on a
+popular short-video sharing platform, a billion-user scale commercial
+short-video application, achieving remarkable performance gains across various
+commercial metrics within cold-start scenarios
+
+
+
+
+
+
+
+ ♻ ☆ Deep Adaptive Interest Network: Personalized Recommendation with
+ Context-Aware Learning
+
+
+
+
+
+
+
+
+ Shuaishuai Huang, Haowei Yang, You Yao, Xueting Lin, Yuming Tu
+
+
+ In personalized recommendation systems, accurately capturing users' evolving
+interests and combining them with contextual information is a critical research
+area. This paper proposes a novel model called the Deep Adaptive Interest
+Network (DAIN), which dynamically models users' interests while incorporating
+context-aware learning mechanisms to achieve precise and adaptive personalized
+recommendations. DAIN leverages deep learning techniques to build an adaptive
+interest network structure that can capture users' interest changes in
+real-time while further optimizing recommendation results by integrating
+contextual information. Experiments conducted on several public datasets
+demonstrate that DAIN excels in both recommendation performance and
+computational efficiency. This research not only provides a new solution for
+personalized recommendation systems but also offers fresh insights into the
+application of context-aware learning in recommendation systems.
+
+
+
+
+
+
+
+ ♻ ☆ TableRAG: Million-Token Table Understanding with Language Models NeurIPS 2024
+
+
+ Recent advancements in language models (LMs) have notably enhanced their
+ability to reason with tabular data, primarily through program-aided mechanisms
+that manipulate and analyze tables. However, these methods often require the
+entire table as input, leading to scalability challenges due to the positional
+bias or context length constraints. In response to these challenges, we
+introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework
+specifically designed for LM-based table understanding. TableRAG leverages
+query expansion combined with schema and cell retrieval to pinpoint crucial
+information before providing it to the LMs. This enables more efficient data
+encoding and precise retrieval, significantly reducing prompt lengths and
+mitigating information loss. We have developed two new million-token benchmarks
+from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's
+effectiveness at scale. Our results demonstrate that TableRAG's retrieval
+design achieves the highest retrieval quality, leading to the new
+state-of-the-art performance on large-scale table understanding.
+
+
+
+ comment: Accepted to NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ RAGONITE: Iterative Retrieval on Induced Databases and Verbalized RDF
+ for Conversational QA over KGs with RAG
+
+
+
+
+
+
+
+
+ Rishiraj Saha Roy, Chris Hinze, Joel Schlotthauer, Farzad Naderi, Viktor Hangya, Andreas Foltyn, Luzian Hahn, Fabian Kuech
+
+
+ Conversational question answering (ConvQA) is a convenient means of searching
+over RDF knowledge graphs (KGs), where a prevalent approach is to translate
+natural language questions to SPARQL queries. However, SPARQL has certain
+shortcomings: (i) it is brittle for complex intents and conversational
+questions, and (ii) it is not suitable for more abstract needs. Instead, we
+propose a novel two-pronged system where we fuse: (i) SQL-query results over a
+database automatically derived from the KG, and (ii) text-search results over
+verbalizations of KG facts. Our pipeline supports iterative retrieval: when the
+results of any branch are found to be unsatisfactory, the system can
+automatically opt for further rounds. We put everything together in a retrieval
+augmented generation (RAG) setup, where an LLM generates a coherent response
+from accumulated search results. We demonstrate the superiority of our proposed
+system over several baselines on a knowledge graph of BMW automobiles.
+
+
+
+ comment: Accepted at BTW 2025, 10 pages
+
+
+
+
+
+
+ ♻ ☆ LLMTreeRec: Unleashing the Power of Large Language Models for Cold-Start
+ Recommendations
+
+
+ The lack of training data gives rise to the system cold-start problem in
+recommendation systems, making them struggle to provide effective
+recommendations. To address this problem, Large Language Models (LLMs) can
+model recommendation tasks as language analysis tasks and provide zero-shot
+results based on their vast open-world knowledge. However, the large scale of
+the item corpus poses a challenge to LLMs, leading to substantial token
+consumption that makes it impractical to deploy in real-world recommendation
+systems. To tackle this challenge, we introduce a tree-based LLM recommendation
+framework LLMTreeRec, which structures all items into an item tree to improve
+the efficiency of LLM's item retrieval. LLMTreeRec achieves state-of-the-art
+performance under the system cold-start setting in two widely used datasets,
+which is even competitive with conventional deep recommendation systems that
+use substantial training data. Furthermore, LLMTreeRec outperforms the baseline
+model in A/B testing on Huawei industrial systems. Consequently, LLMTreeRec
+demonstrates its effectiveness as an industry-friendly solution that has been
+successfully deployed online. Our code is available at:
+https://github.com/Applied-Machine-Learning-Lab/LLMTreeRec.
+
+
+ Long-form document matching aims to judge the relevance between two documents
+and has been applied to various scenarios. Most existing works utilize
+hierarchical or long context models to process documents, which achieve coarse
+understanding but may ignore details. Some researchers construct a document
+view with similar sentences about aligned document subtopics to focus on
+detailed matching signals. However, a long document generally contains multiple
+subtopics. The matching signals are heterogeneous from multiple topics.
+Considering only the homologous aligned subtopics may not be representative
+enough and may cause biased modeling. In this paper, we introduce a new
+framework to model representative matching signals. First, we propose to
+capture various matching signals through subtopics of document pairs. Next, We
+construct multiple document views based on subtopics to cover heterogeneous and
+valuable details. However, existing spatial aggregation methods like attention,
+which integrate all these views simultaneously, are hard to integrate
+heterogeneous information. Instead, we propose temporal aggregation, which
+effectively integrates different views gradually as the training progresses.
+Experimental results show that our learning framework is effective on several
+document-matching tasks, including news duplication and legal case retrieval.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 150
+
+
+
+
+
+ ☆ Decentralized Intelligence in GameFi: Embodied AI Agents and the
+ Convergence of DeFi and Virtual Ecosystems
+
+
+
+
+
+
+
+
+ Fernando Jia, Jade Zheng, Florence Li
+
+
+ In the rapidly evolving landscape of GameFi, a fusion of gaming and
+decentralized finance (DeFi), there exists a critical need to enhance player
+engagement and economic interaction within gaming ecosystems. Our GameFi
+ecosystem aims to fundamentally transform this landscape by integrating
+advanced embodied AI agents into GameFi platforms. These AI agents, developed
+using cutting-edge large language models (LLMs), such as GPT-4 and Claude AI,
+are capable of proactive, adaptive, and contextually rich interactions with
+players. By going beyond traditional scripted responses, these agents become
+integral participants in the game's narrative and economic systems, directly
+influencing player strategies and in-game economies. We address the limitations
+of current GameFi platforms, which often lack immersive AI interactions and
+mechanisms for community engagement or creator monetization. Through the deep
+integration of AI agents with blockchain technology, we establish a
+consensus-driven, decentralized GameFi ecosystem. This ecosystem empowers
+creators to monetize their contributions and fosters democratic collaboration
+among players and creators. Furthermore, by embedding DeFi mechanisms into the
+gaming experience, we enhance economic participation and provide new
+opportunities for financial interactions within the game. Our approach enhances
+player immersion and retention and advances the GameFi ecosystem by bridging
+traditional gaming with Web3 technologies. By integrating sophisticated AI and
+DeFi elements, we contribute to the development of more engaging, economically
+robust, and community-centric gaming environments. This project represents a
+significant advancement in the state-of-the-art in GameFi, offering insights
+and methodologies that can be applied throughout the gaming industry.
+
+
+
+ comment: 11 pages, 4 figures
+
+
+
+
+
+
+ ☆ Structure Learning in Gaussian Graphical Models from Glauber Dynamics
+
+
+ Gaussian graphical model selection is an important paradigm with numerous
+applications, including biological network modeling, financial network
+modeling, and social network analysis. Traditional approaches assume access to
+independent and identically distributed (i.i.d) samples, which is often
+impractical in real-world scenarios. In this paper, we address Gaussian
+graphical model selection under observations from a more realistic dependent
+stochastic process known as Glauber dynamics. Glauber dynamics, also called the
+Gibbs sampler, is a Markov chain that sequentially updates the variables of the
+underlying model based on the statistics of the remaining model. Such models,
+aside from frequently being employed to generate samples from complex
+multivariate distributions, naturally arise in various settings, such as
+opinion consensus in social networks and clearing/stock-price dynamics in
+financial networks.
+ In contrast to the extensive body of existing work, we present the first
+algorithm for Gaussian graphical model selection when data are sampled
+according to the Glauber dynamics. We provide theoretical guarantees on the
+computational and statistical complexity of the proposed algorithm's structure
+learning performance. Additionally, we provide information-theoretic lower
+bounds on the statistical complexity and show that our algorithm is nearly
+minimax optimal for a broad class of problems.
+
+
+
+
+
+
+
+ ☆ Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors:
+ Diverse-Resolution Training Outperforms Interpolation
+
+
+ Deep learning-based 3D imaging, in particular magnetic resonance imaging
+(MRI), is challenging because of limited availability of 3D training data.
+Therefore, 2D diffusion models trained on 2D slices are starting to be
+leveraged for 3D MRI reconstruction. However, as we show in this paper,
+existing methods pertain to a fixed voxel size, and performance degrades when
+the voxel size is varied, as it is often the case in clinical practice. In this
+paper, we propose and study several approaches for resolution-robust 3D MRI
+reconstruction with 2D diffusion priors. As a result of this investigation, we
+obtain a simple resolution-robust variational 3D reconstruction approach based
+on diffusion-guided regularization of randomly sampled 2D slices. This method
+provides competitive reconstruction quality compared to posterior sampling
+baselines. Towards resolving the sensitivity to resolution-shifts, we
+investigate state-of-the-art model-based approaches including Gaussian
+splatting, neural representations, and infinite-dimensional diffusion models,
+as well as a simple data-centric approach of training the diffusion model on
+several resolutions. Our experiments demonstrate that the model-based
+approaches fail to close the performance gap in 3D MRI. In contrast, the
+data-centric approach of training the diffusion model on various resolutions
+effectively provides a resolution-robust method without compromising accuracy.
+
+
+
+
+
+
+
+ ☆ Exploring Embedding Priors in Prompt-Tuning for Improved
+ Interpretability and Control
+
+
+ Prompt-Tuning is an efficient method for adapting pre-trained language models
+to new tasks with minimal computational overhead by modifying prompt
+embeddings. In this work, we investigate how crucial the phenomenon of
+embedding collapse, frequently observed in Prompt-Tuning, is for the final
+performance of the model. To address this question, we designed embedding
+priors and compared them with posteriors of the converged Soft and Deep
+Prompt-Tuning methods. Our findings suggest that priors strongly affect the
+position of the tuned embeddings, and models can effectively work with
+embeddings from different parts of activation spaces, including completely new
+regions. As the final Prompt-Tuning capabilities are limited, we hypothesize
+that controllable Prompt-Tuning posteriors may serve as a good starting point
+for tasks such as chain-of-thought (COT) distillation. Our experiments also
+show that generated trajectories are not localized in the activation space of
+the models. However, there are distinct clusters of activations for distant
+tasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g.,
+Question-Answering and MLM) lie in the same cluster. These observations raise
+questions about the importance of a single activation cluster for the
+generalization abilities of large language models.
+
+
+
+
+
+
+
+
+ Oliver Cassidy, Marta Andronic, Samuel Coward, George A. Constantinides
+
+
+ Lookup tables (LUTs) are frequently used to efficiently store arrays of
+precomputed values for complex mathematical computations. When used in the
+context of neural networks, these functions exhibit a lack of recognizable
+patterns which presents an unusual challenge for conventional logic synthesis
+techniques. Several approaches are known to break down a single large lookup
+table into multiple smaller ones that can be recombined. Traditional methods,
+such as plain tabulation, piecewise linear approximation, and multipartite
+table methods, often yield inefficient hardware solutions when applied to
+LUT-based NNs.
+ This paper introduces ReducedLUT, a novel method to reduce the footprint of
+the LUTs by injecting don't cares into the compression process. This additional
+freedom introduces more self-similarities which can be exploited using known
+decomposition techniques. We then demonstrate a particular application to
+machine learning; by replacing unobserved patterns within the training data of
+neural network models with don't cares, we enable greater compression with
+minimal model accuracy degradation. In practice, we achieve up to $1.63\times$
+reduction in Physical LUT utilization, with a test accuracy drop of no more
+than $0.01$ accuracy points.
+
+
+
+
+
+
+
+
+ Co Tran, Quoc-Bao Tran, Hy Truong Son, Thang N Dinh
+
+
+ Hard combinatorial optimization problems, often mapped to Ising models,
+promise potential solutions with quantum advantage but are constrained by
+limited qubit counts in near-term devices. We present an innovative
+quantum-inspired framework that dynamically compresses large Ising models to
+fit available quantum hardware of different sizes. Thus, we aim to bridge the
+gap between large-scale optimization and current hardware capabilities. Our
+method leverages a physics-inspired GNN architecture to capture complex
+interactions in Ising models and accurately predict alignments among
+neighboring spins (aka qubits) at ground states. By progressively merging such
+aligned spins, we can reduce the model size while preserving the underlying
+optimization structure. It also provides a natural trade-off between the
+solution quality and size reduction, meeting different hardware constraints of
+quantum computing devices. Extensive numerical studies on Ising instances of
+diverse topologies show that our method can reduce instance size at multiple
+levels with virtually no losses in solution quality on the latest D-wave
+quantum annealers.
+
+
+ The problem of evaluating the effectiveness of a treatment or policy commonly
+appears in causal inference applications under network interference. In this
+paper, we suggest the new method of high-dimensional network causal inference
+(HNCI) that provides both valid confidence interval on the average direct
+treatment effect on the treated (ADET) and valid confidence set for the
+neighborhood size for interference effect. We exploit the model setting in
+Belloni et al. (2022) and allow certain type of heterogeneity in node
+interference neighborhood sizes. We propose a linear regression formulation of
+potential outcomes, where the regression coefficients correspond to the
+underlying true interference function values of nodes and exhibit a latent
+homogeneous structure. Such a formulation allows us to leverage existing
+literature from linear regression and homogeneity pursuit to conduct valid
+statistical inferences with theoretical guarantees. The resulting confidence
+intervals for the ADET are formally justified through asymptotic normalities
+with estimable variances. We further provide the confidence set for the
+neighborhood size with theoretical guarantees exploiting the repro samples
+approach. The practical utilities of the newly suggested methods are
+demonstrated through simulation and real data examples.
+
+
+ Aircraft design optimization traditionally relies on computationally
+expensive simulation techniques such as Finite Element Method (FEM) and Finite
+Volume Method (FVM), which, while accurate, can significantly slow down the
+design iteration process. The challenge lies in reducing the computational
+complexity while maintaining high accuracy for quick evaluations of multiple
+design alternatives. This research explores advanced methods, including
+surrogate models, reduced-order models (ROM), and multi-fidelity machine
+learning techniques, to achieve more efficient aircraft design evaluations.
+Specifically, the study investigates the application of Multi-fidelity
+Physics-Informed Neural Networks (MPINN) and autoencoders for manifold
+alignment, alongside the potential of Generative Adversarial Networks (GANs)
+for refining design geometries. Through a proof-of-concept task, the research
+demonstrates the ability to predict high-fidelity results from low-fidelity
+simulations, offering a path toward faster and more cost effective aircraft
+design iterations.
+
+
+
+ comment: 7 pages, 3 figures
+
+
+
+
+
+
+ ☆ FedVCK: Non-IID Robust and Communication-Efficient Federated Learning
+ via Valuable Condensed Knowledge for Medical Image Analysis AAAI 2025
+
+
+ Federated learning has become a promising solution for collaboration among
+medical institutions. However, data owned by each institution would be highly
+heterogeneous and the distribution is always non-independent and identical
+distribution (non-IID), resulting in client drift and unsatisfactory
+performance. Despite existing federated learning methods attempting to solve
+the non-IID problems, they still show marginal advantages but rely on frequent
+communication which would incur high costs and privacy concerns. In this paper,
+we propose a novel federated learning method: \textbf{Fed}erated learning via
+\textbf{V}aluable \textbf{C}ondensed \textbf{K}nowledge (FedVCK). We enhance
+the quality of condensed knowledge and select the most necessary knowledge
+guided by models, to tackle the non-IID problem within limited communication
+budgets effectively. Specifically, on the client side, we condense the
+knowledge of each client into a small dataset and further enhance the
+condensation procedure with latent distribution constraints, facilitating the
+effective capture of high-quality knowledge. During each round, we specifically
+target and condense knowledge that has not been assimilated by the current
+model, thereby preventing unnecessary repetition of homogeneous knowledge and
+minimizing the frequency of communications required. On the server side, we
+propose relational supervised contrastive learning to provide more supervision
+signals to aid the global model updating. Comprehensive experiments across
+various medical tasks show that FedVCK can outperform state-of-the-art methods,
+demonstrating that it's non-IID robust and communication-efficient.
+
+
+ Reasoning is critical for large language models (LLMs) to excel in a wide
+range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM
+performance by decomposing problems into intermediate steps, they also incur
+significant overhead in token usage, leading to increased costs. We find that
+the reasoning process of current LLMs is unnecessarily lengthy and it can be
+compressed by including a reasonable token budget in the prompt, but the choice
+of token budget plays a crucial role in the actual compression effectiveness.
+We then propose a token-budget-aware LLM reasoning framework, which dynamically
+estimates token budgets for different problems based on reasoning complexity
+and uses the estimated token budgets to guide the reasoning process.
+Experiments show that our method effectively reduces token costs in CoT
+reasoning with only a slight performance reduction, offering a practical
+solution to balance efficiency and accuracy in LLM reasoning. Code:
+https://github.com/GeniusHTX/TALE.
+
+
+
+
+
+
+
+ ☆ Consistency Checks for Language Model Forecasters ICLR 2025
+
+
+ Forecasting is a task that is difficult to evaluate: the ground truth can
+only be known in the future. Recent work showing LLM forecasters rapidly
+approaching human-level performance begs the question: how can we benchmark and
+evaluate these forecasters instantaneously? Following the consistency check
+framework, we measure the performance of forecasters in terms of the
+consistency of their predictions on different logically-related questions. We
+propose a new, general consistency metric based on arbitrage: for example, if a
+forecasting AI illogically predicts that both the Democratic and Republican
+parties have 60% probability of winning the 2024 US presidential election, an
+arbitrageur can trade against the forecaster's predictions and make a profit.
+We build an automated evaluation system that generates a set of base questions,
+instantiates consistency checks from these questions, elicits the predictions
+of the forecaster, and measures the consistency of the predictions. We then
+build a standard, proper-scoring-rule forecasting benchmark, and show that our
+(instantaneous) consistency metrics correlate with LLM forecasters' ground
+truth Brier scores (which are only known in the future). We also release a
+consistency benchmark that resolves in 2028, providing a long-term evaluation
+tool for forecasting.
+
+
+ Recent advances in statistical learning theory have revealed profound
+connections between mutual information (MI) bounds, PAC-Bayesian theory, and
+Bayesian nonparametrics. This work introduces a novel mutual information bound
+for statistical models. The derived bound has wide-ranging applications in
+statistical inference. It yields improved contraction rates for fractional
+posteriors in Bayesian nonparametrics. It can also be used to study a wide
+range of estimation methods, such as variational inference or Maximum
+Likelihood Estimation (MLE). By bridging these diverse areas, this work
+advances our understanding of the fundamental limits of statistical inference
+and the role of information in learning from data. We hope that these results
+will not only clarify connections between statistical inference and information
+theory but also help to develop a new toolbox to study a wide range of
+estimators.
+
+
+
+
+
+
+
+ ☆ Graph Structure Learning for Spatial-Temporal Imputation: Adapting to
+ Node and Feature Scales AAAI 2025
+
+
+ Spatial-temporal data collected across different geographic locations often
+suffer from missing values, posing challenges to data analysis. Existing
+methods primarily leverage fixed spatial graphs to impute missing values, which
+implicitly assume that the spatial relationship is roughly the same for all
+features across different locations. However, they may overlook the different
+spatial relationships of diverse features recorded by sensors in different
+locations. To address this, we introduce the multi-scale Graph Structure
+Learning framework for spatial-temporal Imputation (GSLI) that dynamically
+adapts to the heterogeneous spatial correlations. Our framework encompasses
+node-scale graph structure learning to cater to the distinct global spatial
+correlations of different features, and feature-scale graph structure learning
+to unveil common spatial correlation across features within all stations.
+Integrated with prominence modeling, our framework emphasizes nodes and
+features with greater significance in the imputation process. Furthermore, GSLI
+incorporates cross-feature and cross-temporal representation learning to
+capture spatial-temporal dependencies. Evaluated on six real incomplete
+spatial-temporal datasets, GSLI showcases the improvement in data imputation.
+
+
+
+ comment: This paper has been accepted as a full paper at AAAI 2025
+
+ Graph convolutional networks (GCNs) are popular for building machine-learning
+application for graph-structured data. This widespread adoption led to the
+development of specialized GCN hardware accelerators. In this work, we address
+a key architectural challenge for GCN accelerators: how to detect errors in GCN
+computations arising from random hardware faults with the least computation
+cost. Each GCN layer performs a graph convolution, mathematically equivalent to
+multiplying three matrices, computed through two separate matrix
+multiplications. Existing Algorithm-based Fault Tolerance(ABFT) techniques can
+check the results of individual matrix multiplications. However, for a GCN
+layer, this check should be performed twice. To avoid this overhead, this work
+introduces GCN-ABFT that directly calculates a checksum for the entire
+three-matrix product within a single GCN layer, providing a cost-effective
+approach for error detection in GCN accelerators. Experimental results
+demonstrate that GCN-ABFT reduces the number of operations needed for checksum
+computation by over 21% on average for representative GCN applications. These
+savings are achieved without sacrificing fault-detection accuracy, as evidenced
+by the presented fault-injection analysis.
+
+
+
+ comment: Accepted for publication at IEEE Transactions on Computer-Aided
+ Design of Integrated Circuits and Systems (TCAD)
+
+
+
+
+
+
+ ☆ Characterizations of Language Generation With Breadth
+
+
+ We study language generation in the limit, introduced by Kleinberg and
+Mullainathan [KM24], building on classical works of Gold [Gol67] and Angluin
+[Ang79]. [KM24] proposed an algorithm that generates strings from any countable
+language collection in the limit. While their algorithm eventually outputs
+strings from the target language $K$, it sacrifices breadth, i.e., the ability
+to generate all strings in $K$. A key open question in [KM24] is whether this
+trade-off between consistency and breadth is inherrent.
+ Recent works proposed different notions of consistent generation with
+breadth. Kalavasis, Mehrotra, and Velegkas [KVM24] introduced three
+definitions: generation with exact breadth, approximate breadth, and
+unambiguous generation. Concurrently and independently, Charikar and Pabbaraju
+[CP24a] proposed exhaustive generation. Both works examined when generation
+with these notions of breadth is possible.
+ Building on [CP24a, KVM24], we fully characterize language generation for
+these notions and their natural combinations. For exact breadth, we provide an
+unconditional lower bound, removing a technical condition from [KVM24] and
+extending the result of [CP24a] that holds for specific collections of
+languages. We show that generation with exact breadth is characterized by
+Angluin's condition for identification. We further introduce a weaker version
+of Angluin's condition that tightly characterizes both approximate breadth and
+exhaustive generation, proving their equivalence. Additionally, we show that
+unambiguous generation is also characterized by Angluin's condition as a
+special case of a broader result. Finally, we strengthen [KVM24] by giving
+unconditional lower bounds for stable generators, showing that Angluin's
+condition characterizes the previous breadth notions for stable generators.
+This shows a separation between stable and unstable generation with approximate
+breadth.
+
+
+
+ comment: Abstract shortened to fix arXiv limit
+
+
+
+
+
+
+ ☆ Accelerating process control and optimization via machine learning: A
+ review
+
+
+ Process control and optimization have been widely used to solve
+decision-making problems in chemical engineering applications. However,
+identifying and tuning the best solution algorithm is challenging and
+time-consuming. Machine learning tools can be used to automate these steps by
+learning the behavior of a numerical solver from data. In this paper, we
+discuss recent advances in (i) the representation of decision-making problems
+for machine learning tasks, (ii) algorithm selection, and (iii) algorithm
+configuration for monolithic and decomposition-based algorithms. Finally, we
+discuss open problems related to the application of machine learning for
+accelerating process optimization and control.
+
+
+ Bilevel optimization, a hierarchical mathematical framework where one
+optimization problem is nested within another, has emerged as a powerful tool
+for modeling complex decision-making processes in various fields such as
+economics, engineering, and machine learning. This paper focuses on bilevel
+optimization where both upper-level and lower-level functions are black boxes
+and expensive to evaluate. We propose a Bayesian Optimization framework that
+models the upper and lower-level functions as Gaussian processes over the
+combined space of upper and lower-level decisions, allowing us to exploit
+knowledge transfer between different sub-problems. Additionally, we propose a
+novel acquisition function for this model. Our experimental results demonstrate
+that the proposed algorithm is highly sample-efficient and outperforms existing
+methods in finding high-quality solutions.
+
+
+
+
+
+
+
+ ☆ Subsampling, aligning, and averaging to find circular coordinates in
+ recurrent time series
+
+
+
+
+
+
+
+
+ Andrew J. Blumberg, Mathieu Carrière, Jun Hou Fung, Michael A. Mandell
+
+
+ We introduce a new algorithm for finding robust circular coordinates on data
+that is expected to exhibit recurrence, such as that which appears in neuronal
+recordings of C. elegans. Techniques exist to create circular coordinates on a
+simplicial complex from a dimension 1 cohomology class, and these can be
+applied to the Rips complex of a dataset when it has a prominent class in its
+dimension 1 cohomology. However, it is known this approach is extremely
+sensitive to uneven sampling density.
+ Our algorithm comes with a new method to correct for uneven sampling density,
+adapting our prior work on averaging coordinates in manifold learning. We use
+rejection sampling to correct for inhomogeneous sampling and then apply
+Procrustes matching to align and average the subsamples. In addition to
+providing a more robust coordinate than other approaches, this subsampling and
+averaging approach has better efficiency.
+ We validate our technique on both synthetic data sets and neuronal activity
+recordings. Our results reveal a topological model of neuronal trajectories for
+C. elegans that is constructed from loops in which different regions of the
+brain state space can be mapped to specific and interpretable macroscopic
+behaviors in the worm.
+
+
+
+
+
+
+
+ ☆ FedGIG: Graph Inversion from Gradient in Federated Learning
+
+
+ Recent studies have shown that Federated learning (FL) is vulnerable to
+Gradient Inversion Attacks (GIA), which can recover private training data from
+shared gradients. However, existing methods are designed for dense, continuous
+data such as images or vectorized texts, and cannot be directly applied to
+sparse and discrete graph data. This paper first explores GIA's impact on
+Federated Graph Learning (FGL) and introduces Graph Inversion from Gradient in
+Federated Learning (FedGIG), a novel GIA method specifically designed for
+graph-structured data. FedGIG includes the adjacency matrix constraining
+module, which ensures the sparsity and discreteness of the reconstructed graph
+data, and the subgraph reconstruction module, which is designed to complete
+missing common subgraph structures. Extensive experiments on molecular datasets
+demonstrate FedGIG's superior accuracy over existing GIA techniques.
+
+
+
+
+
+
+
+ ☆ An Empirical Analysis of Federated Learning Models Subject to
+ Label-Flipping Adversarial Attack
+
+
+ In this paper, we empirically analyze adversarial attacks on selected
+federated learning models. The specific learning models considered are
+Multinominal Logistic Regression (MLR), Support Vector Classifier (SVC),
+Multilayer Perceptron (MLP), Convolution Neural Network (CNN), %Recurrent
+Neural Network (RNN), Random Forest, XGBoost, and Long Short-Term Memory
+(LSTM). For each model, we simulate label-flipping attacks, experimenting
+extensively with 10 federated clients and 100 federated clients. We vary the
+percentage of adversarial clients from 10% to 100% and, simultaneously, the
+percentage of labels flipped by each adversarial client is also varied from 10%
+to 100%. Among other results, we find that models differ in their inherent
+robustness to the two vectors in our label-flipping attack, i.e., the
+percentage of adversarial clients, and the percentage of labels flipped by each
+adversarial client. We discuss the potential practical implications of our
+results.
+
+
+
+
+
+
+
+ ☆ VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry
+ Extraction from First-Person View Flight Data
+
+
+
+
+
+
+
+
+ James E. Gallagher, Edward J. Oughton
+
+
+ This paper presents the Visual Optical Recognition Telemetry EXtraction
+(VORTEX) system for extracting and analyzing drone telemetry data from First
+Person View (FPV) Uncrewed Aerial System (UAS) footage. VORTEX employs MMOCR, a
+PyTorch-based Optical Character Recognition (OCR) toolbox, to extract telemetry
+variables from drone Heads Up Display (HUD) recordings, utilizing advanced
+image preprocessing techniques, including CLAHE enhancement and adaptive
+thresholding. The study optimizes spatial accuracy and computational efficiency
+through systematic investigation of temporal sampling rates (1s, 5s, 10s, 15s,
+20s) and coordinate processing methods. Results demonstrate that the 5-second
+sampling rate, utilizing 4.07% of available frames, provides the optimal
+balance with a point retention rate of 64% and mean speed accuracy within 4.2%
+of the 1-second baseline while reducing computational overhead by 80.5%.
+Comparative analysis of coordinate processing methods reveals that while UTM
+Zone 33N projection and Haversine calculations provide consistently similar
+results (within 0.1% difference), raw WGS84 coordinates underestimate distances
+by 15-30% and speeds by 20-35%. Altitude measurements showed unexpected
+resilience to sampling rate variations, with only 2.1% variation across all
+intervals. This research is the first of its kind, providing quantitative
+benchmarks for establishing a robust framework for drone telemetry extraction
+and analysis using open-source tools and spatial libraries.
+
+
+
+
+
+
+
+ ☆ An Overview and Discussion of the Suitability of Existing Speech
+ Datasets to Train Machine Learning Models for Collective Problem Solving
+
+
+ This report characterized the suitability of existing datasets for devising
+new Machine Learning models, decision making methods, and analysis algorithms
+to improve Collaborative Problem Solving and then enumerated requirements for
+future datasets to be devised. Problem solving was assumed to be performed in
+teams of about three, four members, which talked to each other. A dataset
+consists of the speech recordings of such teams. The characterization
+methodology was based on metrics that capture cognitive, social, and emotional
+activities and situations. The report presented the analysis of a large group
+of datasets developed for Spoken Language Understanding, a research area with
+some similarity to Collaborative Problem Solving.
+
+
+ Federated learning (FL) is a promising paradigm in distributed learning while
+preserving the privacy of users. However, the increasing size of recent models
+makes it unaffordable for a few users to encompass the model. It leads the
+users to adopt heterogeneous models based on their diverse computing
+capabilities and network bandwidth. Correspondingly, FL with heterogeneous
+models should be addressed, given that FL typically involves training a single
+global model. In this paper, we propose Generative Model-Aided Federated
+Learning (GeFL), incorporating a generative model that aggregates global
+knowledge across users of heterogeneous models. Our experiments on various
+classification tasks demonstrate notable performance improvements of GeFL
+compared to baselines, as well as limitations in terms of privacy and
+scalability. To tackle these concerns, we introduce a novel framework, GeFL-F.
+It trains target networks aided by feature-generative models. We empirically
+demonstrate the consistent performance gains of GeFL-F, while demonstrating
+better privacy preservation and robustness to a large number of clients. Codes
+are available at [1].
+
+
+
+ comment: 20 pages
+
+
+
+
+
+
+ ☆ SoK: On the Offensive Potential of AI
+
+
+
+
+
+
+
+
+ Saskia Laura Schröer, Giovanni Apruzzese, Soheil Human, Pavel Laskov, Hyrum S. Anderson, Edward W. N. Bernroider, Aurore Fass, Ben Nassi, Vera Rimmer, Fabio Roli, Samer Salam, Ashley Shen, Ali Sunyaev, Tim Wadwha-Brown, Isabel Wagner, Gang Wang
+
+
+ Our society increasingly benefits from Artificial Intelligence (AI).
+Unfortunately, more and more evidence shows that AI is also used for offensive
+purposes. Prior works have revealed various examples of use cases in which the
+deployment of AI can lead to violation of security and privacy objectives. No
+extant work, however, has been able to draw a holistic picture of the offensive
+potential of AI. In this SoK paper we seek to lay the ground for a systematic
+analysis of the heterogeneous capabilities of offensive AI. In particular we
+(i) account for AI risks to both humans and systems while (ii) consolidating
+and distilling knowledge from academic literature, expert opinions, industrial
+venues, as well as laymen -- all of which being valuable sources of information
+on offensive AI.
+ To enable alignment of such diverse sources of knowledge, we devise a common
+set of criteria reflecting essential technological factors related to offensive
+AI. With the help of such criteria, we systematically analyze: 95 research
+papers; 38 InfoSec briefings (from, e.g., BlackHat); the responses of a user
+study (N=549) entailing individuals with diverse backgrounds and expertise; and
+the opinion of 12 experts. Our contributions not only reveal concerning ways
+(some of which overlooked by prior work) in which AI can be offensively used
+today, but also represent a foothold to address this threat in the years to
+come.
+
+
+
+ comment: Systemization of Knowledge (SoK) paper
+
+
+
+
+
+
+ ☆ MixMAS: A Framework for Sampling-Based Mixer Architecture Search for
+ Multimodal Fusion and Learning
+
+
+ Choosing a suitable deep learning architecture for multimodal data fusion is
+a challenging task, as it requires the effective integration and processing of
+diverse data types, each with distinct structures and characteristics. In this
+paper, we introduce MixMAS, a novel framework for sampling-based mixer
+architecture search tailored to multimodal learning. Our approach automatically
+selects the optimal MLP-based architecture for a given multimodal machine
+learning (MML) task. Specifically, MixMAS utilizes a sampling-based
+micro-benchmarking strategy to explore various combinations of
+modality-specific encoders, fusion functions, and fusion networks,
+systematically identifying the architecture that best meets the task's
+performance metrics.
+
+
+
+
+
+
+
+ ☆ Gaussian entropic optimal transport: Schrödinger bridges and the
+ Sinkhorn algorithm
+
+
+
+
+
+
+
+
+ O. Deniz Akyildiz, Pierre Del Moral, Joaquín Miguez
+
+
+ Entropic optimal transport problems are regularized versions of optimal
+transport problems. These models play an increasingly important role in machine
+learning and generative modelling. For finite spaces, these problems are
+commonly solved using Sinkhorn algorithm (a.k.a. iterative proportional fitting
+procedure). However, in more general settings the Sinkhorn iterations are based
+on nonlinear conditional/conjugate transformations and exact finite-dimensional
+solutions cannot be computed. This article presents a finite-dimensional
+recursive formulation of the iterative proportional fitting procedure for
+general Gaussian multivariate models. As expected, this recursive formulation
+is closely related to the celebrated Kalman filter and related Riccati matrix
+difference equations, and it yields algorithms that can be implemented in
+practical settings without further approximations. We extend this filtering
+methodology to develop a refined and self-contained convergence analysis of
+Gaussian Sinkhorn algorithms, including closed form expressions of entropic
+transport maps and Schr\"odinger bridges.
+
+
+
+ comment: 68 pages
+
+
+
+
+
+
+ ☆ Discovery of 2D Materials via Symmetry-Constrained Diffusion Model
+
+
+ Generative model for 2D materials has shown significant promise in
+accelerating the material discovery process. The stability and performance of
+these materials are strongly influenced by their underlying symmetry. However,
+existing generative models for 2D materials often neglect symmetry constraints,
+which limits both the diversity and quality of the generated structures. Here,
+we introduce a symmetry-constrained diffusion model (SCDM) that integrates
+space group symmetry into the generative process. By incorporating Wyckoff
+positions, the model ensures adherence to symmetry principles, leading to the
+generation of 2,000 candidate structures. DFT calculations were conducted to
+evaluate the convex hull energies of these structures after structural
+relaxation. From the generated samples, 843 materials that met the energy
+stability criteria (Ehull < 0.6 eV/atom) were identified. Among these, six
+candidates were selected for further stability analysis, including phonon band
+structure evaluations and electronic properties investigations, all of which
+exhibited phonon spectrum stability. To benchmark the performance of SCDM, a
+symmetry-unconstrained diffusion model was also evaluated via crystal structure
+prediction model. The results highlight that incorporating symmetry constraints
+enhances the effectiveness of generated 2D materials, making a contribution to
+the discovery of 2D materials through generative modeling.
+
+
+
+
+
+
+
+ ☆ A Statistical Framework for Ranking LLM-Based Chatbots
+
+
+
+
+
+
+
+
+ Siavash Ameli, Siyuan Zhuang, Ion Stoica, Michael W. Mahoney
+
+
+ Large language models (LLMs) have transformed natural language processing,
+with frameworks like Chatbot Arena providing pioneering platforms for
+evaluating these models. By facilitating millions of pairwise comparisons based
+on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation,
+offering rich datasets for ranking models in open-ended conversational tasks.
+Building upon this foundation, we propose a statistical framework that
+incorporates key advancements to address specific challenges in pairwise
+comparison analysis. First, we introduce a factored tie model that enhances the
+ability to handle ties -- an integral aspect of human-judged comparisons --
+significantly improving the model's fit to observed data. Second, we extend the
+framework to model covariance between competitors, enabling deeper insights
+into performance relationships and facilitating intuitive groupings into
+performance tiers. Third, we resolve optimization challenges arising from
+parameter non-uniqueness by introducing novel constraints, ensuring stable and
+interpretable parameter estimation. Through rigorous evaluation and extensive
+experimentation, our framework demonstrates substantial improvements over
+existing methods in modeling pairwise comparison data. To support
+reproducibility and practical adoption, we release leaderbot, an open-source
+Python package implementing our models and analyses.
+
+
+ Recent vision-language foundation models still frequently produce outputs
+misaligned with their inputs, evidenced by object hallucination in captioning
+and prompt misalignment in the text-to-image generation model. Recent studies
+have explored methods for identifying misaligned elements, aiming not only to
+enhance interpretability but also to improve model performance. However,
+current approaches primarily rely on large foundation models in a zero-shot
+manner or fine-tuned models with human annotations, which limits scalability
+due to significant computational costs. This work proposes a novel approach,
+dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP,
+specifically focusing on pinpointing misaligned words between image and text.
+We carefully revamp the gradient-based attribution computation method, enabling
+negative gradient of individual text tokens to indicate misalignment. We also
+propose F-CLIPScore, which aggregates misaligned attributions with a global
+alignment score. We evaluate our method on various dense misalignment detection
+benchmarks, covering various image and text domains and misalignment types. Our
+method demonstrates state-of-the-art performance among zero-shot models and
+competitive performance with fine-tuned models while maintaining superior
+efficiency. Our qualitative examples show that our method has a unique strength
+to detect entity-level objects, intangible objects, and attributes that can not
+be easily detected for existing works. We conduct ablation studies and analyses
+to highlight the strengths and limitations of our approach. Our code is
+publicly available at https://github.com/naver-ai/CLIP4DM.
+
+
+ Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach
+for high-fidelity image synthesis, operating diffusion processes on continuous
+VAE latent, which significantly differ from the text generation methods
+employed by Large Language Models (LLMs). In this paper, we introduce a novel
+generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which
+enhances the diffusion process through a recurrent token prediction mechanism,
+thereby pioneering the field of Discrete Diffusion. By progressively
+introducing Gaussian noise into the latent representations of images and
+encoding them into vector-quantized tokens in a recurrent manner, RDPM
+facilitates a unique diffusion process on discrete-value domains. This process
+iteratively predicts the token codes for subsequent timesteps, transforming the
+initial standard Gaussian noise into the source data distribution, aligning
+with GPT-style models in terms of the loss function. RDPM demonstrates superior
+performance while benefiting from the speed advantage of requiring only a few
+inference steps. This model not only leverages the diffusion process to ensure
+high-quality generation but also converts continuous signals into a series of
+high-fidelity discrete tokens, thereby maintaining a unified optimization
+strategy with other discrete tokens, such as text. We anticipate that this work
+will contribute to the development of a unified model for multimodal
+generation, specifically by integrating continuous signal domains such as
+images, videos, and audio with text. We will release the code and model weights
+to the open-source community.
+
+
+
+ comment: 8 pages
+
+
+
+
+
+
+ ☆ Weak Scaling Capability in Token Space: An Observation from Large Vision
+ Language Model
+
+
+ The scaling capability has been widely validated with respect to the number
+of parameters and the size of training data. One important question that is
+unexplored is that does scaling capability also exists similarly with respect
+to the number of vision tokens? This study fills the gap by investigating the
+relationship between the number of vision tokens and the performance of
+vision-language models. Our theoretical analysis and empirical evaluations
+reveal that the model exhibits weak scaling capabilities on the length \(N_l\),
+with performance approximately \(S(N_l) \approx (c/N_l)^{\alpha}\), where \(c,
+\alpha\) are hyperparameters. Interestingly, this scaling behavior remains
+largely unaffected by the inclusion or exclusion of the user's question in the
+input. Furthermore, fusing the user's question with the vision token can
+enhance model performance when the question is relevant to the task. To address
+the computational challenges associated with large-scale vision tokens, we
+propose a novel architecture that efficiently reduces the token count while
+integrating user question tokens into the representation. Our findings may
+offer insights for developing more efficient and effective vision-language
+models under specific task constraints.
+
+
+
+
+
+
+
+ ☆ ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with
+ LLM-based Chatbots
+
+
+ The rise of LLMs has deflected a growing portion of human-computer
+interactions towards LLM-based chatbots. The remarkable abilities of these
+models allow users to interact using long, diverse natural language text
+covering a wide range of topics and styles. Phrasing these messages is a time
+and effort consuming task, calling for an autocomplete solution to assist
+users. We introduce the task of chatbot interaction autocomplete. We present
+ChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework
+for LLM-based chatbot interactions. The framework includes a formal definition
+of the task, coupled with suitable datasets and metrics. We use the framework
+to evaluate After formally defining the task along with suitable datasets and
+metrics, we test 9 models on the defined auto completion task, finding that
+while current off-the-shelf models perform fairly, there is still much room for
+improvement, mainly in ranking of the generated suggestions. We provide
+insights for practitioners working on this task and open new research
+directions for researchers in the field. We release our framework to serve as a
+foundation for future research.
+
+
+
+
+
+
+
+ ☆ Unveiling the Threat of Fraud Gangs to Graph Neural Networks:
+ Multi-Target Graph Injection Attacks against GNN-Based Fraud Detectors AAAI
+
+
+ Graph neural networks (GNNs) have emerged as an effective tool for fraud
+detection, identifying fraudulent users, and uncovering malicious behaviors.
+However, attacks against GNN-based fraud detectors and their risks have rarely
+been studied, thereby leaving potential threats unaddressed. Recent findings
+suggest that frauds are increasingly organized as gangs or groups. In this
+work, we design attack scenarios where fraud gangs aim to make their fraud
+nodes misclassified as benign by camouflaging their illicit activities in
+collusion. Based on these scenarios, we study adversarial attacks against
+GNN-based fraud detectors by simulating attacks of fraud gangs in three
+real-world fraud cases: spam reviews, fake news, and medical insurance frauds.
+We define these attacks as multi-target graph injection attacks and propose
+MonTi, a transformer-based Multi-target one-Time graph injection attack model.
+MonTi simultaneously generates attributes and edges of all attack nodes with a
+transformer encoder, capturing interdependencies between attributes and edges
+more effectively than most existing graph injection attack methods that
+generate these elements sequentially. Additionally, MonTi adaptively allocates
+the degree budget for each attack node to explore diverse injection structures
+involving target, candidate, and attack nodes, unlike existing methods that fix
+the degree budget across all attack nodes. Experiments show that MonTi
+outperforms the state-of-the-art graph injection attack methods on five
+real-world graphs.
+
+
+
+ comment: 19 pages, 5 figures, 12 tables, The 39th AAAI Conference on
+ Artificial Intelligence (AAAI 2025)
+
+
+
+
+
+
+ ☆ Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges AAAI
+
+
+ Recent studies have shown that Hypergraph Neural Networks (HGNNs) are
+vulnerable to adversarial attacks. Existing approaches focus on hypergraph
+modification attacks guided by gradients, overlooking node spanning in the
+hypergraph and the group identity of hyperedges, thereby resulting in limited
+attack performance and detectable attacks. In this manuscript, we present a
+novel framework, i.e., Hypergraph Attacks via Injecting Homogeneous Nodes into
+Elite Hyperedges (IE-Attack), to tackle these challenges. Initially, utilizing
+the node spanning in the hypergraph, we propose the elite hyperedges sampler to
+identify hyperedges to be injected. Subsequently, a node generator utilizing
+Kernel Density Estimation (KDE) is proposed to generate the homogeneous node
+with the group identity of hyperedges. Finally, by injecting the homogeneous
+node into elite hyperedges, IE-Attack improves the attack performance and
+enhances the imperceptibility of attacks. Extensive experiments are conducted
+on five authentic datasets to validate the effectiveness of IE-Attack and the
+corresponding superiority to state-of-the-art methods.
+
+
+
+ comment: 9 pages, The 39th Annual AAAI Conference on Artificial
+ Intelligence(2025)
+
+
+
+
+
+
+ ☆ Point-DeepONet: A Deep Operator Network Integrating PointNet for
+ Nonlinear Analysis of Non-Parametric 3D Geometries and Load Conditions
+
+
+ Nonlinear structural analyses in engineering often require extensive finite
+element simulations, limiting their applicability in design optimization,
+uncertainty quantification, and real-time control. Conventional deep learning
+surrogates, such as convolutional neural networks (CNNs), physics-informed
+neural networks (PINNs), and fourier neural operators (FNOs), face challenges
+with complex non-parametric three-dimensional (3D) geometries, directionally
+varying loads, and high-fidelity predictions on unstructured meshes. This work
+presents Point-DeepONet, an operator-learning-based surrogate that integrates
+PointNet into the DeepONet framework. By directly processing non-parametric
+point clouds and incorporating signed distance functions (SDF) for geometric
+context, Point-DeepONet accurately predicts three-dimensional displacement and
+von Mises stress fields without mesh parameterization or retraining. Trained
+using only about 5,000 nodes (2.5% of the original 200,000-node mesh),
+Point-DeepONet can still predict the entire mesh at high fidelity, achieving a
+coefficient of determination reaching 0.987 for displacement and 0.923 for von
+Mises stress under a horizontal load case. Compared to nonlinear finite element
+analyses that require about 19.32 minutes per case, Point-DeepONet provides
+predictions in mere seconds-approximately 400 times faster-while maintaining
+excellent scalability and accuracy with increasing dataset sizes. These
+findings highlight the potential of Point-DeepONet to enable rapid,
+high-fidelity structural analyses, ultimately supporting more effective design
+exploration and informed decision-making in complex engineering workflows.
+
+
+
+ comment: 23 pages, 16 figures, and 5 tables
+
+
+
+
+
+
+ ☆ Addressing Spatial-Temporal Data Heterogeneity in Federated Continual
+ Learning via Tail Anchor
+
+
+ Federated continual learning (FCL) allows each client to continually update
+its knowledge from task streams, enhancing the applicability of federated
+learning in real-world scenarios. However, FCL needs to address not only
+spatial data heterogeneity between clients but also temporal data heterogeneity
+between tasks. In this paper, empirical experiments demonstrate that such
+input-level heterogeneity significantly affects the model's internal parameters
+and outputs, leading to severe spatial-temporal catastrophic forgetting of
+local and previous knowledge. To this end, we propose Federated Tail Anchor
+(FedTA) to mix trainable Tail Anchor with the frozen output features to adjust
+their position in the feature space, thereby overcoming parameter-forgetting
+and output-forgetting. Moreover, three novel components are also included in
+FedTA: Input Enhancement for improving the performance of pre-trained models on
+downstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous
+local knowledge on the server side; and Best Global Prototype Selection for
+finding the best anchor point for each class in the feature space. Extensive
+experiments demonstrate that FedTA not only outperforms existing FCL methods
+but also effectively preserves the relative positions of features, remaining
+unaffected by spatial and temporal changes.
+
+
+
+
+
+
+
+ ☆ Predator Prey Scavenger Model using Holling's Functional Response of
+ Type III and Physics-Informed Deep Neural Networks
+
+
+ Nonlinear mathematical models introduce the relation between various physical
+and biological interactions present in nature. One of the most famous models is
+the Lotka-Volterra model which defined the interaction between predator and
+prey species present in nature. However, predators, scavengers, and prey
+populations coexist in a natural system where scavengers can additionally rely
+on the dead bodies of predators present in the system. Keeping this in mind,
+the formulation and simulation of the predator prey scavenger model is
+introduced in this paper. For the predation response, respective prey species
+are assumed to have Holling's functional response of type III. The proposed
+model is tested for various simulations and is found to be showing satisfactory
+results in different scenarios. After simulations, the American forest dataset
+is taken for parameter estimation which imitates the real-world case. For
+parameter estimation, a physics-informed deep neural network is used with the
+Adam backpropagation method which prevents the avalanche effect in trainable
+parameters updation. For neural networks, mean square error and
+physics-informed informed error are considered. After the neural network, the
+hence-found parameters are fine-tuned using the
+Broyden-Fletcher-Goldfarb-Shanno algorithm. Finally, the hence-found parameters
+using a natural dataset are tested for stability using Jacobian stability
+analysis. Future research work includes minimization of error induced by
+parameters, bifurcation analysis, and sensitivity analysis of the parameters.
+
+
+
+
+
+
+
+
+ Kunyu Peng, Di Wen, Sarfraz M. Saquib, Yufan Chen, Junwei Zheng, David Schneider, Kailun Yang, Jiamin Wu, Alina Roitberg, Rainer Stiefelhagen
+
+
+ Open-Set Domain Generalization (OSDG) is a challenging task requiring models
+to accurately predict familiar categories while minimizing confidence for
+unknown categories to effectively reject them in unseen domains. While the OSDG
+field has seen considerable advancements, the impact of label noise--a common
+issue in real-world datasets--has been largely overlooked. Label noise can
+mislead model optimization, thereby exacerbating the challenges of open-set
+recognition in novel domains. In this study, we take the first step towards
+addressing Open-Set Domain Generalization under Noisy Labels (OSDG-NL) by
+constructing dedicated benchmarks derived from widely used OSDG datasets,
+including PACS and DigitsDG. We evaluate baseline approaches by integrating
+techniques from both label denoising and OSDG methodologies, highlighting the
+limitations of existing strategies in handling label noise effectively. To
+address these limitations, we propose HyProMeta, a novel framework that
+integrates hyperbolic category prototypes for label noise-aware meta-learning
+alongside a learnable new-category agnostic prompt designed to enhance
+generalization to unseen classes. Our extensive experiments demonstrate the
+superior performance of HyProMeta compared to state-of-the-art methods across
+the newly established benchmarks. The source code of this work is released at
+https://github.com/KPeng9510/HyProMeta.
+
+
+
+ comment: The source code of this work is released at
+ https://github.com/KPeng9510/HyProMeta
+
+
+
+
+
+
+ ☆ Exploring Graph Mamba: A Comprehensive Survey on State-Space Models for
+ Graph Learning
+
+
+
+
+
+
+
+
+ Safa Ben Atitallah, Chaima Ben Rabah, Maha Driss, Wadii Boulila, Anis Koubaa
+
+
+ Graph Mamba, a powerful graph embedding technique, has emerged as a
+cornerstone in various domains, including bioinformatics, social networks, and
+recommendation systems. This survey represents the first comprehensive study
+devoted to Graph Mamba, to address the critical gaps in understanding its
+applications, challenges, and future potential. We start by offering a detailed
+explanation of the original Graph Mamba architecture, highlighting its key
+components and underlying mechanisms. Subsequently, we explore the most recent
+modifications and enhancements proposed to improve its performance and
+applicability. To demonstrate the versatility of Graph Mamba, we examine its
+applications across diverse domains. A comparative analysis of Graph Mamba and
+its variants is conducted to shed light on their unique characteristics and
+potential use cases. Furthermore, we identify potential areas where Graph Mamba
+can be applied in the future, highlighting its potential to revolutionize data
+analysis in these fields. Finally, we address the current limitations and open
+research questions associated with Graph Mamba. By acknowledging these
+challenges, we aim to stimulate further research and development in this
+promising area. This survey serves as a valuable resource for both newcomers
+and experienced researchers seeking to understand and leverage the power of
+Graph Mamba.
+
+
+
+
+
+
+
+
+ Ahmed E. Samy, Zekarias T. Kefatoa, Sarunas Girdzijauskasa
+
+
+ Self-supervised graph representation learning (SSGRL) is a representation
+learning paradigm used to reduce or avoid manual labeling. An essential part of
+SSGRL is graph data augmentation. Existing methods usually rely on heuristics
+commonly identified through trial and error and are effective only within some
+application domains. Also, it is not clear why one heuristic is better than
+another. Moreover, recent studies have argued against some techniques (e.g.,
+dropout: that can change the properties of molecular graphs or destroy relevant
+signals for graph-based document classification tasks).
+ In this study, we propose a novel data-driven SSGRL approach that
+automatically learns a suitable graph augmentation from the signal encoded in
+the graph (i.e., the nodes' predictive feature and topological information). We
+propose two complementary approaches that produce learnable feature and
+topological augmentations. The former learns multi-view augmentation of node
+features, and the latter learns a high-order view of the topology. Moreover,
+the augmentations are jointly learned with the representation. Our approach is
+general that it can be applied to homogeneous and heterogeneous graphs. We
+perform extensive experiments on node classification (using nine homogeneous
+and heterogeneous datasets) and graph property prediction (using another eight
+datasets). The results show that the proposed method matches or outperforms the
+SOTA SSGRL baselines and performs similarly to semi-supervised methods. The
+anonymised source code is available at https://github.com/AhmedESamy/dsgrl/
+
+
+
+
+
+
+
+
+ Jaechul Roh, Andrew Yuan, Jinsong Mao
+
+
+ Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the
+generation of high-quality images that align closely with textual descriptions.
+However, this progress has also raised concerns about their misuse for
+propaganda and other malicious activities. Recent studies reveal that attackers
+can embed biases into these models through simple fine-tuning, causing them to
+generate targeted imagery when triggered by specific phrases. This underscores
+the potential for T2I models to act as tools for disseminating propaganda,
+producing images aligned with an attacker's objective for end-users.
+ Building on this concept, we introduce FameBias, a T2I biasing attack that
+manipulates the embeddings of input prompts to generate images featuring
+specific public figures. Unlike prior methods, Famebias operates solely on the
+input embedding vectors without requiring additional model training. We
+evaluate FameBias comprehensively using Stable Diffusion V2, generating a large
+corpus of images based on various trigger nouns and target public figures. Our
+experiments demonstrate that FameBias achieves a high attack success rate while
+preserving the semantic context of the original prompts across multiple
+trigger-target pairs.
+
+
+
+
+
+
+
+ ☆ Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight
+
+
+ Video anomaly detection (VAD) has witnessed significant advancements through
+the integration of large language models (LLMs) and vision-language models
+(VLMs), addressing critical challenges such as interpretability, temporal
+reasoning, and generalization in dynamic, open-world scenarios. This paper
+presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024,
+focusing on four key aspects: (i) enhancing interpretability through semantic
+insights and textual explanations, making visual anomalies more understandable;
+(ii) capturing intricate temporal relationships to detect and localize dynamic
+anomalies across video frames; (iii) enabling few-shot and zero-shot detection
+to minimize reliance on large, annotated datasets; and (iv) addressing
+open-world and class-agnostic anomalies by using semantic understanding and
+motion features for spatiotemporal coherence. We highlight their potential to
+redefine the landscape of VAD. Additionally, we explore the synergy between
+visual and textual modalities offered by LLMs and VLMs, highlighting their
+combined strengths and proposing future directions to fully exploit the
+potential in enhancing video anomaly detection.
+
+
+
+ comment: Research report
+
+
+
+
+
+
+ ☆ Learning to Play Against Unknown Opponents
+
+
+ We consider the problem of a learning agent who has to repeatedly play a
+general sum game against a strategic opponent who acts to maximize their own
+payoff by optimally responding against the learner's algorithm. The learning
+agent knows their own payoff function, but is uncertain about the payoff of
+their opponent (knowing only that it is drawn from some distribution
+$\mathcal{D}$). What learning algorithm should the agent run in order to
+maximize their own total utility?
+ We demonstrate how to construct an $\varepsilon$-optimal learning algorithm
+(obtaining average utility within $\varepsilon$ of the optimal utility) for
+this problem in time polynomial in the size of the input and $1/\varepsilon$
+when either the size of the game or the support of $\mathcal{D}$ is constant.
+When the learning algorithm is further constrained to be a no-regret algorithm,
+we demonstrate how to efficiently construct an optimal learning algorithm
+(asymptotically achieving the optimal utility) in polynomial time, independent
+of any other assumptions. Both results make use of recently developed machinery
+that converts the analysis of learning algorithms to the study of the class of
+corresponding geometric objects known as menus.
+
+
+
+
+
+
+
+ ☆ Navigating Data Corruption in Machine Learning: Balancing Quality,
+ Quantity, and Imputation Strategies
+
+
+ Data corruption, including missing and noisy data, poses significant
+challenges in real-world machine learning. This study investigates the effects
+of data corruption on model performance and explores strategies to mitigate
+these effects through two experimental setups: supervised learning with NLP
+tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization
+(Signal-RL). We analyze the relationship between data corruption levels and
+model performance, evaluate the effectiveness of data imputation methods, and
+assess the utility of enlarging datasets to address data corruption.
+ Our results show that model performance under data corruption follows a
+diminishing return curve, modeled by the exponential function. Missing data,
+while detrimental, is less harmful than noisy data, which causes severe
+performance degradation and training instability, particularly in sequential
+decision-making tasks like Signal-RL. Imputation strategies involve a
+trade-off: they recover missing information but may introduce noise. Their
+effectiveness depends on imputation accuracy and corruption ratio. We identify
+distinct regions in the imputation advantage heatmap, including an "imputation
+advantageous corner" and an "imputation disadvantageous edge" and classify
+tasks as "noise-sensitive" or "noise-insensitive" based on their decision
+boundaries.
+ Furthermore, we find that increasing dataset size mitigates but cannot fully
+overcome the effects of data corruption. The marginal utility of additional
+data diminishes as corruption increases. An empirical rule emerges:
+approximately 30% of the data is critical for determining performance, while
+the remaining 70% has minimal impact.
+ These findings provide actionable insights into data preprocessing,
+imputation strategies, and data collection practices, guiding the development
+of robust machine learning systems in noisy environments.
+
+
+
+
+
+
+
+ ☆ DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation
+
+
+
+
+
+
+
+
+ Junyi Lu, Xiaojia Li, Zihan Hua, Lei Yu, Shiqi Cheng, Li Yang, Fengjun Zhang, Chun Zuo
+
+
+ Code review is a vital but demanding aspect of software development,
+generating significant interest in automating review comments. Traditional
+evaluation methods for these comments, primarily based on text similarity, face
+two major challenges: inconsistent reliability of human-authored comments in
+open-source projects and the weak correlation of text similarity with
+objectives like enhancing code quality and detecting defects.
+ This study empirically analyzes benchmark comments using a novel set of
+criteria informed by prior research and developer interviews. We then similarly
+revisit the evaluation of existing methodologies. Our evaluation framework,
+DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a
+comprehensive reassessment of current techniques based on the criteria set.
+Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer,
+leveraging the few-shot learning capabilities of LLMs for a target-oriented
+comparison.
+ Our research highlights the limitations of text similarity metrics, finding
+that less than 10% of benchmark comments are high quality for automation. In
+contrast, DeepCRCEval effectively distinguishes between high and low-quality
+comments, proving to be a more reliable evaluation mechanism. Incorporating LLM
+evaluators into DeepCRCEval significantly boosts efficiency, reducing time and
+cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates
+significant potential of focusing task real targets in comment generation.
+
+
+
+ comment: Accepted to the 28th International Conference on Fundamental
+ Approaches to Software Engineering (FASE 2025), part of the 28th European
+ Joint Conferences on Theory and Practice of Software (ETAPS 2025)
+
+
+
+
+
+
+ ☆ Dissipation alters modes of information encoding in small quantum
+ reservoirs near criticality
+
+
+ Quantum reservoir computing (QRC) has emerged as a promising paradigm for
+harnessing near-term quantum devices to tackle temporal machine learning tasks.
+Yet identifying the mechanisms that underlie enhanced performance remains
+challenging, particularly in many-body open systems where nonlinear
+interactions and dissipation intertwine in complex ways. Here, we investigate a
+minimal model of a driven-dissipative quantum reservoir described by two
+coupled Kerr-nonlinear oscillators, an experimentally realizable platform that
+features controllable coupling, intrinsic nonlinearity, and tunable photon
+loss. Using Partial Information Decomposition (PID), we examine how different
+dynamical regimes encode input drive signals in terms of redundancy
+(information shared by each oscillator) and synergy (information accessible
+only through their joint observation). Our key results show that, near a
+critical point marking a dynamical bifurcation, the system transitions from
+predominantly redundant to synergistic encoding. We further demonstrate that
+synergy amplifies short-term responsiveness, thereby enhancing immediate memory
+retention, whereas strong dissipation leads to more redundant encoding that
+supports long-term memory retention. These findings elucidate how the interplay
+of instability and dissipation shapes information processing in small quantum
+systems, providing a fine-grained, information-theoretic perspective for
+analyzing and designing QRC platforms.
+
+
+
+ comment: 30 pages, 12 figures
+
+
+
+
+
+
+ ☆ Towards understanding how attention mechanism works in deep learning
+
+
+ Attention mechanism has been extensively integrated within mainstream neural
+network architectures, such as Transformers and graph attention networks. Yet,
+its underlying working principles remain somewhat elusive. What is its essence?
+Are there any connections between it and traditional machine learning
+algorithms? In this study, we inspect the process of computing similarity using
+classic metrics and vector space properties in manifold learning, clustering,
+and supervised learning. We identify the key characteristics of similarity
+computation and information propagation in these methods and demonstrate that
+the self-attention mechanism in deep learning adheres to the same principles
+but operates more flexibly and adaptively. We decompose the self-attention
+mechanism into a learnable pseudo-metric function and an information
+propagation process based on similarity computation. We prove that the
+self-attention mechanism converges to a drift-diffusion process through
+continuous modeling provided the pseudo-metric is a transformation of a metric
+and certain reasonable assumptions hold. This equation could be transformed
+into a heat equation under a new metric. In addition, we give a first-order
+analysis of attention mechanism with a general pseudo-metric function. This
+study aids in understanding the effects and principle of attention mechanism
+through physical intuition. Finally, we propose a modified attention mechanism
+called metric-attention by leveraging the concept of metric learning to
+facilitate the ability to learn desired metrics more effectively. Experimental
+results demonstrate that it outperforms self-attention regarding training
+efficiency, accuracy, and robustness.
+
+
+ Credit card fraud incurs a considerable cost for both cardholders and issuing
+banks. Contemporary methods apply machine learning-based classifiers to detect
+fraudulent behavior from labeled transaction records. But labeled data are
+usually a small proportion of billions of real transactions due to expensive
+labeling costs, which implies that they do not well exploit many natural
+features from unlabeled data. Therefore, we propose a semi-supervised graph
+neural network for fraud detection. Specifically, we leverage transaction
+records to construct a temporal transaction graph, which is composed of
+temporal transactions (nodes) and interactions (edges) among them. Then we pass
+messages among the nodes through a Gated Temporal Attention Network (GTAN) to
+learn the transaction representation. We further model the fraud patterns
+through risk propagation among transactions. The extensive experiments are
+conducted on a real-world transaction dataset and two publicly available fraud
+detection datasets. The result shows that our proposed method, namely GTAN,
+outperforms other state-of-the-art baselines on three fraud detection datasets.
+Semi-supervised experiments demonstrate the excellent fraud detection
+performance of our model with only a tiny proportion of labeled data.
+
+
+ We define the local complexity of a neural network with continuous piecewise
+linear activations as a measure of the density of linear regions over an input
+data distribution. We show theoretically that ReLU networks that learn
+low-dimensional feature representations have a lower local complexity. This
+allows us to connect recent empirical observations on feature learning at the
+level of the weight matrices with concrete properties of the learned functions.
+In particular, we show that the local complexity serves as an upper bound on
+the total variation of the function over the input data distribution and thus
+that feature learning can be related to adversarial robustness. Lastly, we
+consider how optimization drives ReLU networks towards solutions with lower
+local complexity. Overall, this work contributes a theoretical framework
+towards relating geometric properties of ReLU networks to different aspects of
+learning such as feature learning and representation cost.
+
+
+
+
+
+
+
+ ☆ GDM4MMIMO: Generative Diffusion Models for Massive MIMO Communications
+
+
+ Massive multiple-input multiple-output (MIMO) offers significant advantages
+in spectral and energy efficiencies, positioning it as a cornerstone technology
+of fifth-generation (5G) wireless communication systems and a promising
+solution for the burgeoning data demands anticipated in sixth-generation (6G)
+networks. In recent years, with the continuous advancement of artificial
+intelligence (AI), a multitude of task-oriented generative foundation models
+(GFMs) have emerged, achieving remarkable performance in various fields such as
+computer vision (CV), natural language processing (NLP), and autonomous
+driving. As a pioneering force, these models are driving the paradigm shift in
+AI towards generative AI (GenAI). Among them, the generative diffusion model
+(GDM), as one of state-of-the-art families of generative models, demonstrates
+an exceptional capability to learn implicit prior knowledge and robust
+generalization capabilities, thereby enhancing its versatility and
+effectiveness across diverse applications. In this paper, we delve into the
+potential applications of GDM in massive MIMO communications. Specifically, we
+first provide an overview of massive MIMO communication, the framework of GFMs,
+and the working mechanism of GDM. Following this, we discuss recent research
+advancements in the field and present a case study of near-field channel
+estimation based on GDM, demonstrating its promising potential for facilitating
+efficient ultra-dimensional channel statement information (CSI) acquisition in
+the context of massive MIMO communications. Finally, we highlight several
+pressing challenges in future mobile communications and identify promising
+research directions surrounding GDM.
+
+
+
+ comment: 6 pages, 3 figures
+
+
+
+
+
+
+ ☆ Towards Modality Generalization: A Benchmark and Prospective Analysis
+
+
+ Multi-modal learning has achieved remarkable success by integrating
+information from various modalities, achieving superior performance in tasks
+like recognition and retrieval compared to uni-modal approaches. However,
+real-world scenarios often present novel modalities that are unseen during
+training due to resource and privacy constraints, a challenge current methods
+struggle to address. This paper introduces Modality Generalization (MG), which
+focuses on enabling models to generalize to unseen modalities. We define two
+cases: weak MG, where both seen and unseen modalities can be mapped into a
+joint embedding space via existing perceptors, and strong MG, where no such
+mappings exist. To facilitate progress, we propose a comprehensive benchmark
+featuring multi-modal algorithms and adapt existing methods that focus on
+generalization. Extensive experiments highlight the complexity of MG, exposing
+the limitations of existing methods and identifying key directions for future
+research. Our work provides a foundation for advancing robust and adaptable
+multi-modal models, enabling them to handle unseen modalities in realistic
+scenarios.
+
+
+ Real-world graph data environments intrinsically exist noise (e.g., link and
+structure errors) that inevitably disturb the effectiveness of graph
+representation and downstream learning tasks. For homogeneous graphs, the
+latest works use original node features to synthesize a similarity graph that
+can correct the structure of the noised graph. This idea is based on the
+homogeneity assumption, which states that similar nodes in the homogeneous
+graph tend to have direct links in the original graph. However, similar nodes
+in heterogeneous graphs usually do not have direct links, which can not be used
+to correct the original noise graph. This causes a significant challenge in
+noised heterogeneous graph learning. To this end, this paper proposes a novel
+synthesized similarity-based graph neural network compatible with noised
+heterogeneous graph learning. First, we calculate the original feature
+similarities of all nodes to synthesize a similarity-based high-order graph.
+Second, we propose a similarity-aware encoder to embed original and synthesized
+graphs with shared parameters. Then, instead of graph-to-graph supervising, we
+synchronously supervise the original and synthesized graph embeddings to
+predict the same labels. Meanwhile, a target-based graph extracted from the
+synthesized graph contrasts the structure of the metapath-based graph extracted
+from the original graph to learn the mutual information. Extensive experiments
+in numerous real-world datasets show the proposed method achieves
+state-of-the-art records in the noised heterogeneous graph learning tasks. In
+highlights, +5$\sim$6\% improvements are observed in several noised datasets
+compared with previous SOTA methods. The code and datasets are available at
+https://github.com/kg-cc/NoiseHGNN.
+
+
+
+ comment: AAAI2025
+
+
+
+
+
+
+ ☆ Free the Design Space of Equivariant Graph Neural Networks: High-Rank
+ Irreducible Cartesian Tensor Decomposition and Bases of Equivariant Spaces
+
+
+ Irreducible Cartesian tensors (ICTs) play a crucial role in the design of
+equivariant graph neural networks, as well as in theoretical chemistry and
+chemical physics. Meanwhile, the design space of available linear operations on
+tensors that preserve symmetry presents a significant challenge. The ICT
+decomposition and a basis of this equivariant space are difficult to obtain for
+high-order tensors. After decades of research, we recently achieve an explicit
+ICT decomposition for $n=5$ \citep{bonvicini2024irreducible} with factorial
+time/space complexity. This work, for the first time, obtains decomposition
+matrices for ICTs up to rank $n=9$ with reduced and affordable complexity, by
+constructing what we call path matrices. The path matrices are obtained via
+performing chain-like contraction with Clebsch-Gordan matrices following the
+parentage scheme. We prove and leverage that the concatenation of path matrices
+is an orthonormal change-of-basis matrix between the Cartesian tensor product
+space and the spherical direct sum spaces. Furthermore, we identify a complete
+orthogonal basis for the equivariant space, rather than a spanning set
+\citep{pearce2023brauer}, through this path matrices technique. We further
+extend our result to the arbitrary tensor product and direct sum spaces,
+enabling free design between different spaces while keeping symmetry. The
+Python code is available in the appendix where the $n=6,\dots,9$ ICT
+decomposition matrices are obtained in <0.1s, 0.5s, 1s, 3s, 11s, and 4m32s,
+respectively.
+
+
+ Recent work revealed a tight connection between adversarial robustness and
+restricted forms of symbolic explanations, namely distance-based (formal)
+explanations. This connection is significant because it represents a first step
+towards making the computation of symbolic explanations as efficient as
+deciding the existence of adversarial examples, especially for highly complex
+machine learning (ML) models. However, a major performance bottleneck remains,
+because of the very large number of features that ML models may possess, in
+particular for deep neural networks. This paper proposes novel algorithms to
+compute the so-called contrastive explanations for ML models with a large
+number of features, by leveraging on adversarial robustness. Furthermore, the
+paper also proposes novel algorithms for listing explanations and finding
+smallest contrastive explanations. The experimental results demonstrate the
+performance gains achieved by the novel algorithms proposed in this paper.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2405.08297
+
+
+
+
+
+
+ ☆ Robust Semi-Supervised Learning in Open Environments
+
+
+ Semi-supervised learning (SSL) aims to improve performance by exploiting
+unlabeled data when labels are scarce. Conventional SSL studies typically
+assume close environments where important factors (e.g., label, feature,
+distribution) between labeled and unlabeled data are consistent. However, more
+practical tasks involve open environments where important factors between
+labeled and unlabeled data are inconsistent. It has been reported that
+exploiting inconsistent unlabeled data causes severe performance degradation,
+even worse than the simple supervised learning baseline. Manually verifying the
+quality of unlabeled data is not desirable, therefore, it is important to study
+robust SSL with inconsistent unlabeled data in open environments. This paper
+briefly introduces some advances in this line of research, focusing on
+techniques concerning label, feature, and data distribution inconsistency in
+SSL, and presents the evaluation benchmarks. Open research problems are also
+discussed for reference purposes.
+
+
+
+ comment: 12 pages, 4 figures
+
+
+
+
+
+
+ ☆ Detection and Forecasting of Parkinson Disease Progression from Speech
+ Signal Features Using MultiLayer Perceptron and LSTM
+
+
+
+
+
+
+
+
+ Majid Ali, Hina Shakir, Asia Samreen, Sohaib Ahmed
+
+
+ Accurate diagnosis of Parkinson disease, especially in its early stages, can
+be a challenging task. The application of machine learning techniques helps
+improve the diagnostic accuracy of Parkinson disease detection but only few
+studies have presented work towards the prediction of disease progression. In
+this research work, Long Short Term Memory LSTM was trained using the
+diagnostic features on Parkinson patients speech signals, to predict the
+disease progression while a Multilayer Perceptron MLP was trained on the same
+diagnostic features to detect the disease. Diagnostic features selected using
+two well-known feature selection methods named Relief-F and Sequential Forward
+Selection and applied on LSTM and MLP have shown to accurately predict the
+disease progression as stage 2 and 3 and its existence respectively.
+
+
+
+
+
+
+
+ ☆ Fréchet regression for multi-label feature selection with implicit
+ regularization
+
+
+ Fr\'echet regression extends linear regression to model complex responses
+ in metric spaces, making it particularly relevant for multi-label regression,
+ where each instance can have multiple associated labels. However, variable
+ selection within this framework remains underexplored. In this paper, we pro
+pose a novel variable selection method that employs implicit regularization
+ instead of traditional explicit regularization approaches, which can
+introduce
+ bias. Our method effectively captures nonlinear interactions between predic
+tors and responses while promoting model sparsity. We provide theoretical
+ results demonstrating selection consistency and illustrate the performance of
+ our approach through numerical examples
+
+
+
+
+
+
+
+ ☆ OMG-HD: A High-Resolution AI Weather Model for End-to-End Forecasts from
+ Observations
+
+
+ In recent years, Artificial Intelligence Weather Prediction (AIWP) models
+have achieved performance comparable to, or even surpassing, traditional
+Numerical Weather Prediction (NWP) models by leveraging reanalysis data.
+However, a less-explored approach involves training AIWP models directly on
+observational data, enhancing computational efficiency and improving forecast
+accuracy by reducing the uncertainties introduced through data assimilation
+processes. In this study, we propose OMG-HD, a novel AI-based regional
+high-resolution weather forecasting model designed to make predictions directly
+from observational data sources, including surface stations, radar, and
+satellite, thereby removing the need for operational data assimilation. Our
+evaluation shows that OMG-HD outperforms both the European Centre for
+Medium-Range Weather Forecasts (ECMWF)'s high-resolution operational
+forecasting system, IFS-HRES, and the High-Resolution Rapid Refresh (HRRR)
+model at lead times of up to 12 hours across the contiguous United States
+(CONUS) region. We achieve up to a 13% improvement on RMSE for 2-meter
+temperature, 17% on 10-meter wind speed, 48% on 2-meter specific humidity, and
+32% on surface pressure compared to HRRR. Our method shows that it is possible
+to use AI-driven approaches for rapid weather predictions without relying on
+NWP-derived weather fields as model input. This is a promising step towards
+using observational data directly to make operational forecasts with AIWP
+models.
+
+
+
+
+
+
+
+ ☆ Schödinger Bridge Type Diffusion Models as an Extension of Variational
+ Autoencoders
+
+
+ Generative diffusion models use time-forward and backward stochastic
+differential equations to connect the data and prior distributions. While
+conventional diffusion models (e.g., score-based models) only learn the
+backward process, more flexible frameworks have been proposed to also learn the
+forward process by employing the Schr\"odinger bridge (SB). However, due to the
+complexity of the mathematical structure behind SB-type models, we can not
+easily give an intuitive understanding of their objective function. In this
+work, we propose a unified framework to construct diffusion models by
+reinterpreting the SB-type models as an extension of variational autoencoders.
+In this context, the data processing inequality plays a crucial role. As a
+result, we find that the objective function consists of the prior loss and
+drift matching parts.
+
+
+
+
+
+
+
+ ☆ Conditional Deep Canonical Time Warping
+
+
+ Temporal alignment of sequences is a fundamental challenge in many
+applications, such as computer vision and bioinformatics, where local time
+shifting needs to be accounted for. Misalignment can lead to poor model
+generalization, especially in high-dimensional sequences. Existing methods
+often struggle with optimization when dealing with high-dimensional sparse
+data, falling into poor alignments. Feature selection is frequently used to
+enhance model performance for sparse data. However, a fixed set of selected
+features would not generally work for dynamically changing sequences and would
+need to be modified based on the state of the sequence. Therefore, modifying
+the selected feature based on contextual input would result in better
+alignment. Our suggested method, Conditional Deep Canonical Temporal Time
+Warping (CDCTW), is designed for temporal alignment in sparse temporal data to
+address these challenges. CDCTW enhances alignment accuracy for high
+dimensional time-dependent views be performing dynamic time warping on data
+embedded in maximally correlated subspace which handles sparsity with novel
+feature selection method. We validate the effectiveness of CDCTW through
+extensive experiments on various datasets, demonstrating superior performance
+over previous techniques.
+
+
+
+
+
+
+
+
+ Yan Zhang, Guoqiang Wu, Bingzheng Wang, Teng Pang, Haoliang Sun, Yilong Yin
+
+
+ In Continual Learning (CL), while existing work primarily focuses on the
+multi-class classification task, there has been limited research on Multi-Label
+Learning (MLL). In practice, MLL datasets are often class-imbalanced, making it
+inherently challenging, a problem that is even more acute in CL. Due to its
+sensitivity to imbalance, Macro-AUC is an appropriate and widely used measure
+in MLL. However, there is no research to optimize Macro-AUC in MLCL
+specifically. To fill this gap, in this paper, we propose a new memory
+replay-based method to tackle the imbalance issue for Macro-AUC-oriented MLCL.
+Specifically, inspired by recent theory work, we propose a new Reweighted
+Label-Distribution-Aware Margin (RLDAM) loss. Furthermore, to be compatible
+with the RLDAM loss, a new memory-updating strategy named Weight Retain
+Updating (WRU) is proposed to maintain the numbers of positive and negative
+instances of the original dataset in memory. Theoretically, we provide superior
+generalization analyses of the RLDAM-based algorithm in terms of Macro-AUC,
+separately in batch MLL and MLCL settings. This is the first work to offer
+theoretical generalization analyses in MLCL to our knowledge. Finally, a series
+of experimental results illustrate the effectiveness of our method over several
+baselines. Our codes are available at
+https://github.com/ML-Group-SDU/Macro-AUC-CL.
+
+
+
+ comment: 7 pages of main text, 11 pages of appendix, accepted to AAAI 2025
+
+ With the development of the financial industry, credit default prediction, as
+an important task in financial risk management, has received increasing
+attention. Traditional credit default prediction methods mostly rely on machine
+learning models, such as decision trees and random forests, but these methods
+have certain limitations in processing complex data and capturing potential
+risk patterns. To this end, this paper proposes a deep learning model based on
+the combination of convolutional neural networks (CNN) and Transformer for
+credit user default prediction. The model combines the advantages of CNN in
+local feature extraction with the ability of Transformer in global dependency
+modeling, effectively improving the accuracy and robustness of credit default
+prediction. Through experiments on public credit default datasets, the results
+show that the CNN+Transformer model outperforms traditional machine learning
+models, such as random forests and XGBoost, in multiple evaluation indicators
+such as accuracy, AUC, and KS value, demonstrating its powerful ability in
+complex financial data modeling. Further experimental analysis shows that
+appropriate optimizer selection and learning rate adjustment play a vital role
+in improving model performance. In addition, the ablation experiment of the
+model verifies the advantages of the combination of CNN and Transformer and
+proves the complementarity of the two in credit default prediction. This study
+provides a new idea for credit default prediction and provides strong support
+for risk assessment and intelligent decision-making in the financial field.
+Future research can further improve the prediction effect and generalization
+ability by introducing more unstructured data and improving the model
+architecture.
+
+
+
+
+
+
+
+ ☆ GIMS: Image Matching System Based on Adaptive Graph Construction and
+ Graph Neural Network
+
+
+
+
+
+
+
+
+ Xianfeng Song, Yi Zou, Zheng Shi, Zheng Liu
+
+
+ Feature-based image matching has extensive applications in computer vision.
+Keypoints detected in images can be naturally represented as graph structures,
+and Graph Neural Networks (GNNs) have been shown to outperform traditional deep
+learning techniques. Consequently, the paradigm of image matching via GNNs has
+gained significant prominence in recent academic research. In this paper, we
+first introduce an innovative adaptive graph construction method that utilizes
+a filtering mechanism based on distance and dynamic threshold similarity. This
+method dynamically adjusts the criteria for incorporating new vertices based on
+the characteristics of existing vertices, allowing for the construction of more
+precise and robust graph structures while avoiding redundancy. We further
+combine the vertex processing capabilities of GNNs with the global awareness
+capabilities of Transformers to enhance the model's representation of spatial
+and feature information within graph structures. This hybrid model provides a
+deeper understanding of the interrelationships between vertices and their
+contributions to the matching process. Additionally, we employ the Sinkhorn
+algorithm to iteratively solve for optimal matching results. Finally, we
+validate our system using extensive image datasets and conduct comprehensive
+comparative experiments. Experimental results demonstrate that our system
+achieves an average improvement of 3.8x-40.3x in overall matching performance.
+Additionally, the number of vertices and edges significantly impacts training
+efficiency and memory usage; therefore, we employ multi-GPU technology to
+accelerate the training process. Our code is available at
+https://github.com/songxf1024/GIMS.
+
+
+
+
+
+
+
+ ☆ On the Effectiveness of Adversarial Training on Malware Classifiers
+
+
+
+
+
+
+
+
+ Hamid Bostani, Jacopo Cortellazzi, Daniel Arp, Fabio Pierazzi, Veelasha Moonsamy, Lorenzo Cavallaro
+
+
+ Adversarial Training (AT) has been widely applied to harden learning-based
+classifiers against adversarial evasive attacks. However, its effectiveness in
+identifying and strengthening vulnerable areas of the model's decision space
+while maintaining high performance on clean data of malware classifiers remains
+an under-explored area. In this context, the robustness that AT achieves has
+often been assessed against unrealistic or weak adversarial attacks, which
+negatively affect performance on clean data and are arguably no longer threats.
+Previous work seems to suggest robustness is a task-dependent property of AT.
+We instead argue it is a more complex problem that requires exploring AT and
+the intertwined roles played by certain factors within data, feature
+representations, classifiers, and robust optimization settings, as well as
+proper evaluation factors, such as the realism of evasion attacks, to gain a
+true sense of AT's effectiveness. In our paper, we address this gap by
+systematically exploring the role such factors have in hardening malware
+classifiers through AT. Contrary to recent prior work, a key observation of our
+research and extensive experiments confirm the hypotheses that all such factors
+influence the actual effectiveness of AT, as demonstrated by the varying
+degrees of success from our empirical analysis. We identify five evaluation
+pitfalls that affect state-of-the-art studies and summarize our insights in ten
+takeaways to draw promising research directions toward better understanding the
+factors' settings under which adversarial training works at best.
+
+
+
+
+
+
+
+ ☆ U-Mamba-Net: A highly efficient Mamba-based U-net style network for
+ noisy and reverberant speech separation
+
+
+ The topic of speech separation involves separating mixed speech with multiple
+overlapping speakers into several streams, with each stream containing speech
+from only one speaker. Many highly effective models have emerged and
+proliferated rapidly over time. However, the size and computational load of
+these models have also increased accordingly. This is a disaster for the
+community, as researchers need more time and computational resources to
+reproduce and compare existing models. In this paper, we propose U-mamba-net: a
+lightweight Mamba-based U-style model for speech separation in complex
+environments. Mamba is a state space sequence model that incorporates feature
+selection capabilities. U-style network is a fully convolutional neural network
+whose symmetric contracting and expansive paths are able to learn
+multi-resolution features. In our work, Mamba serves as a feature filter,
+alternating with U-Net. We test the proposed model on Libri2mix. The results
+show that U-Mamba-Net achieves improved performance with quite low
+computational cost.
+
+
+ Artificial Intelligence Generated Content (AIGC) has gained significant
+popularity for creating diverse content. Current AIGC models primarily focus on
+content quality within a centralized framework, resulting in a high service
+delay and negative user experiences. However, not only does the workload of an
+AIGC task depend on the AIGC model's complexity rather than the amount of data,
+but the large model and its multi-layer encoder structure also result in a huge
+demand for computational and memory resources. These unique characteristics
+pose new challenges in its modeling, deployment, and scheduling at edge
+networks. Thus, we model an offloading problem among edges for providing real
+AIGC services and propose LAD-TS, a novel Latent Action Diffusion-based Task
+Scheduling method that orchestrates multiple edge servers for expedited AIGC
+services. The LAD-TS generates a near-optimal offloading decision by leveraging
+the diffusion model's conditional generation capability and the reinforcement
+learning's environment interaction ability, thereby minimizing the service
+delays under multiple resource constraints. Meanwhile, a latent action
+diffusion strategy is designed to guide decision generation by utilizing
+historical action probability, enabling rapid achievement of near-optimal
+decisions. Furthermore, we develop DEdgeAI, a prototype edge system with a
+refined AIGC model deployment to implement and evaluate our LAD-TS method.
+DEdgeAI provides a real AIGC service for users, demonstrating up to 29.18%
+shorter service delays than the current five representative AIGC platforms. We
+release our open-source code at https://github.com/ChangfuXu/DEdgeAI/.
+
+
+ This paper introduces a quantum framework for addressing reinforcement
+learning (RL) tasks, grounded in the quantum principles and leveraging a fully
+quantum model of the classical Markov Decision Process (MDP). By employing
+quantum concepts and a quantum search algorithm, this work presents the
+implementation and optimization of the agent-environment interactions entirely
+within the quantum domain, eliminating reliance on classical computations. Key
+contributions include the quantum-based state transitions, return calculation,
+and trajectory search mechanism that utilize quantum principles to demonstrate
+the realization of RL processes through quantum phenomena. The implementation
+emphasizes the fundamental role of quantum superposition in enhancing
+computational efficiency for RL tasks. Experimental results demonstrate the
+capacity of a quantum model to achieve quantum advantage in RL, highlighting
+the potential of fully quantum implementations in decision-making tasks. This
+work not only underscores the applicability of quantum computing in machine
+learning but also contributes the field of quantum reinforcement learning (QRL)
+by offering a robust framework for understanding and exploiting quantum
+computing in RL systems.
+
+
+
+
+
+
+
+ ☆ Sharper Error Bounds in Late Fusion Multi-view Clustering Using
+ Eigenvalue Proportion
+
+
+ Multi-view clustering (MVC) aims to integrate complementary information from
+multiple views to enhance clustering performance. Late Fusion Multi-View
+Clustering (LFMVC) has shown promise by synthesizing diverse clustering results
+into a unified consensus. However, current LFMVC methods struggle with noisy
+and redundant partitions and often fail to capture high-order correlations
+across views. To address these limitations, we present a novel theoretical
+framework for analyzing the generalization error bounds of multiple kernel
+$k$-means, leveraging local Rademacher complexity and principal eigenvalue
+proportions. Our analysis establishes a convergence rate of $\mathcal{O}(1/n)$,
+significantly improving upon the existing rate in the order of
+$\mathcal{O}(\sqrt{k/n})$. Building on this insight, we propose a low-pass
+graph filtering strategy within a multiple linear $k$-means framework to
+mitigate noise and redundancy, further refining the principal eigenvalue
+proportion and enhancing clustering accuracy. Experimental results on benchmark
+datasets confirm that our approach outperforms state-of-the-art methods in
+clustering performance and robustness. The related codes is available at
+https://github.com/csliangdu/GMLKM .
+
+
+
+
+
+
+
+ ☆ Developing Cryptocurrency Trading Strategy Based on Autoencoder-CNN-GANs
+ Algorithms
+
+
+ This paper leverages machine learning algorithms to forecast and analyze
+financial time series. The process begins with a denoising autoencoder to
+filter out random noise fluctuations from the main contract price data. Then,
+one-dimensional convolution reduces the dimensionality of the filtered data and
+extracts key information. The filtered and dimensionality-reduced price data is
+fed into a GANs network, and its output serve as input of a fully connected
+network. Through cross-validation, a model is trained to capture features that
+precede large price fluctuations. The model predicts the likelihood and
+direction of significant price changes in real-time price sequences, placing
+trades at moments of high prediction accuracy. Empirical results demonstrate
+that using autoencoders and convolution to filter and denoise financial data,
+combined with GANs, achieves a certain level of predictive performance,
+validating the capabilities of machine learning algorithms to discover
+underlying patterns in financial sequences. Keywords - CNN;GANs;
+Cryptocurrency; Prediction.
+
+
+
+ comment: The paper was accepted by 2024 4th International Conference on
+ Artificial Intelligence, Robotics, and Communication(ICAIRC 2024)
+
+
+
+
+
+
+ ☆ Leveraging Deep Learning with Multi-Head Attention for Accurate
+ Extraction of Medicine from Handwritten Prescriptions
+
+
+
+
+
+
+
+
+ Usman Ali, Sahil Ranmbail, Muhammad Nadeem, Hamid Ishfaq, Muhammad Umer Ramzan, Waqas Ali
+
+
+ Extracting medication names from handwritten doctor prescriptions is
+challenging due to the wide variability in handwriting styles and prescription
+formats. This paper presents a robust method for extracting medicine names
+using a combination of Mask R-CNN and Transformer-based Optical Character
+Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A
+novel dataset, featuring diverse handwritten prescriptions from various regions
+of Pakistan, was utilized to fine-tune the model on different handwriting
+styles. The Mask R-CNN model segments the prescription images to focus on the
+medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and
+Positional Embeddings, transcribes the isolated text. The transcribed text is
+then matched against a pre-existing database for accurate identification. The
+proposed approach achieved a character error rate (CER) of 1.4% on standard
+benchmarks, highlighting its potential as a reliable and efficient tool for
+automating medicine name extraction.
+
+
+
+
+
+
+
+
+ Zeru Shi, Zhenting Wang, Yongye Su, Weidi Luo, Fan Yang, Yongfeng Zhang
+
+
+ The performance of Large Language Models (LLMs) is based on the quality of
+the prompts and the semantic and structural integrity information of the input
+data. However, current prompt generation methods primarily focus on generating
+prompts for clean input data, often overlooking the impact of perturbed inputs
+on prompt performance. To address this limitation, we propose BATprompt (By
+Adversarial Training prompt), a novel method for prompt generation designed to
+withstand input perturbations (such as typos in the input). Inspired by
+adversarial training techniques, BATprompt demonstrates strong performance on a
+variety of perturbed tasks through a two-step process: adversarial perturbation
+and iterative optimization on unperturbed input via LLM. Unlike conventional
+adversarial attack methods, BATprompt avoids reliance on real gradients or
+model parameters. Instead, it leverages the advanced reasoning, language
+understanding and self reflection capabilities of LLMs to simulate gradients,
+guiding the generation of adversarial perturbations and optimizing prompt
+performance. In our experiments, we evaluate BATprompt on multiple datasets
+across both language understanding and generation tasks. The results indicate
+that BATprompt outperforms existing prompt generation methods, delivering
+superior robustness and performance under diverse perturbation scenarios.
+
+
+
+
+
+
+
+ ☆ Learning Sign Language Representation using CNN LSTM, 3DCNN, CNN RNN
+ LSTM and CCN TD
+
+
+ Existing Sign Language Learning applications focus on the demonstration of
+the sign in the hope that the student will copy a sign correctly. In these
+cases, only a teacher can confirm that the sign was completed correctly, by
+reviewing a video captured manually. Sign Language Translation is a widely
+explored field in visual recognition. This paper seeks to explore the
+algorithms that will allow for real-time, video sign translation, and grading
+of sign language accuracy for new sign language users. This required algorithms
+capable of recognizing and processing spatial and temporal features. The aim of
+this paper is to evaluate and identify the best neural network algorithm that
+can facilitate a sign language tuition system of this nature. Modern popular
+algorithms including CNN and 3DCNN are compared on a dataset not yet explored,
+Trinidad and Tobago Sign Language as well as an American Sign Language dataset.
+The 3DCNN algorithm was found to be the best performing neural network
+algorithm from these systems with 91% accuracy in the TTSL dataset and 83%
+accuracy in the ASL dataset.
+
+
+
+ comment: 10 pages
+
+
+
+
+
+
+ ☆ Unified Stochastic Framework for Neural Network Quantization and Pruning
+
+
+ Quantization and pruning are two essential techniques for compressing neural
+networks, yet they are often treated independently, with limited theoretical
+analysis connecting them. This paper introduces a unified framework for
+post-training quantization and pruning using stochastic path-following
+algorithms. Our approach builds on the Stochastic Path Following Quantization
+(SPFQ) method, extending its applicability to pruning and low-bit quantization,
+including challenging 1-bit regimes. By incorporating a scaling parameter and
+generalizing the stochastic operator, the proposed method achieves robust error
+correction and yields rigorous theoretical error bounds for both quantization
+and pruning as well as their combination.
+
+
+ For a data-generating process for random variables that can be described with
+a linear structural equation model, we consider a situation in which (i) a set
+of covariates satisfying the back-door criterion cannot be observed or (ii)
+such a set can be observed, but standard statistical estimation methods cannot
+be applied to estimate causal effects because of
+multicollinearity/high-dimensional data problems. We propose a novel two-stage
+penalized regression approach, the penalized covariate-mediator selection
+operator (PCM Selector), to estimate the causal effects in such scenarios.
+Unlike existing penalized regression analyses, when a set of intermediate
+variables is available, PCM Selector provides a consistent or less biased
+estimator of the causal effect. In addition, PCM Selector provides a variable
+selection procedure for intermediate variables to obtain better estimation
+accuracy of the causal effects than does the back-door criterion.
+
+
+
+
+
+
+
+ ☆ Enhancing Online Continual Learning with Plug-and-Play State Space Model
+ and Class-Conditional Mixture of Discretization
+
+
+
+
+
+
+
+
+ Sihao Liu, Yibo Yang, Xiaojie Li, David A. Clifton, Bernard Ghanem
+
+
+ Online continual learning (OCL) seeks to learn new tasks from data streams
+that appear only once, while retaining knowledge of previously learned tasks.
+Most existing methods rely on replay, focusing on enhancing memory retention
+through regularization or distillation. However, they often overlook the
+adaptability of the model, limiting the ability to learn generalizable and
+discriminative features incrementally from online training data. To address
+this, we introduce a plug-and-play module, S6MOD, which can be integrated into
+most existing methods and directly improve adaptability. Specifically, S6MOD
+introduces an extra branch after the backbone, where a mixture of
+discretization selectively adjusts parameters in a selective state space model,
+enriching selective scan patterns such that the model can adaptively select the
+most sensitive discretization method for current dynamics. We further design a
+class-conditional routing algorithm for dynamic, uncertainty-based adjustment
+and implement a contrastive discretization loss to optimize it. Extensive
+experiments combining our module with various models demonstrate that S6MOD
+significantly enhances model adaptability, leading to substantial performance
+gains and achieving the state-of-the-art results.
+
+
+
+
+
+
+
+ ☆ Stochastic Control for Fine-tuning Diffusion Models: Optimality,
+ Regularity, and Convergence
+
+
+ Diffusion models have emerged as powerful tools for generative modeling,
+demonstrating exceptional capability in capturing target data distributions
+from large datasets. However, fine-tuning these massive models for specific
+downstream tasks, constraints, and human preferences remains a critical
+challenge. While recent advances have leveraged reinforcement learning
+algorithms to tackle this problem, much of the progress has been empirical,
+with limited theoretical understanding. To bridge this gap, we propose a
+stochastic control framework for fine-tuning diffusion models. Building on
+denoising diffusion probabilistic models as the pre-trained reference dynamics,
+our approach integrates linear dynamics control with Kullback-Leibler
+regularization. We establish the well-posedness and regularity of the
+stochastic control problem and develop a policy iteration algorithm (PI-FT) for
+numerical solution. We show that PI-FT achieves global convergence at a linear
+rate. Unlike existing work that assumes regularities throughout training, we
+prove that the control and value sequences generated by the algorithm maintain
+the regularity. Additionally, we explore extensions of our framework to
+parametric settings and continuous-time formulations.
+
+
+
+ comment: 28 pages
+
+
+
+
+
+
+ ☆ Neural Conformal Control for Time Series Forecasting
+
+
+ We introduce a neural network conformal prediction method for time series
+that enhances adaptivity in non-stationary environments. Our approach acts as a
+neural controller designed to achieve desired target coverage, leveraging
+auxiliary multi-view data with neural network encoders in an end-to-end manner
+to further enhance adaptivity. Additionally, our model is designed to enhance
+the consistency of prediction intervals in different quantiles by integrating
+monotonicity constraints and leverages data from related tasks to boost
+few-shot learning performance. Using real-world datasets from epidemics,
+electric demand, weather, and others, we empirically demonstrate significant
+improvements in coverage and probabilistic accuracy, and find that our method
+is the only one that combines good calibration with consistency in prediction
+intervals.
+
+
+
+
+
+
+
+ ☆ An Instrumental Value for Data Production and its Application to Data
+ Pricing
+
+
+ How much value does a dataset or a data production process have to an agent
+who wishes to use the data to assist decision-making? This is a fundamental
+question towards understanding the value of data as well as further pricing of
+data. This paper develops an approach for capturing the instrumental value of
+data production processes, which takes two key factors into account: (a) the
+context of the agent's decision-making problem; (b) prior data or information
+the agent already possesses. We ''micro-found'' our valuation concepts by
+showing how they connect to classic notions of information design and signals
+in information economics. When instantiated in the domain of Bayesian linear
+regression, our value naturally corresponds to information gain. Based on our
+designed data value, we then study a basic monopoly pricing setting with a
+buyer looking to purchase from a seller some labeled data of a certain feature
+direction in order to improve a Bayesian regression model. We show that when
+the seller has the ability to fully customize any data request, she can extract
+the first-best revenue (i.e., full surplus) from any population of buyers,
+i.e., achieving first-degree price discrimination. If the seller can only sell
+data that are derived from an existing data pool, this limits her ability to
+customize, and achieving first-best revenue becomes generally impossible.
+However, we design a mechanism that achieves seller revenue at most $\log
+(\kappa)$ less than the first-best revenue, where $\kappa$ is the condition
+number associated with the data matrix. A corollary of this result is that the
+seller can extract the first-best revenue in the multi-armed bandits special
+case.
+
+
+
+
+
+
+
+ ☆ Fundamental Limits in the Search for Less Discriminatory Algorithms --
+ and How to Avoid Them NeurIPS
+
+
+ Disparate impact doctrine offers an important legal apparatus for targeting
+unfair data-driven algorithmic decisions. A recent body of work has focused on
+conceptualizing and operationalizing one particular construct from this
+doctrine -- the less discriminatory alternative, an alternative policy that
+reduces disparities while meeting the same business needs of a status quo or
+baseline policy. This paper puts forward four fundamental results, which each
+represent limits to searching for and using less discriminatory algorithms
+(LDAs). (1) Statistically, although LDAs are almost always identifiable in
+retrospect on fixed populations, making conclusions about how alternative
+classifiers perform on an unobserved distribution is more difficult. (2)
+Mathematically, a classifier can only exhibit certain combinations of accuracy
+and selection rate disparity between groups, given the size of each group and
+the base rate of the property or outcome of interest in each group. (3)
+Computationally, a search for a lower-disparity classifier at some baseline
+level of utility is NP-hard. (4) From a modeling and consumer welfare
+perspective, defining an LDA only in terms of business needs can lead to LDAs
+that leave consumers strictly worse off, including members of the disadvantaged
+group. These findings, which may seem on their face to give firms strong
+defenses against discrimination claims, only tell part of the story. For each
+of our negative results limiting what is attainable in this setting, we offer
+positive results demonstrating that there exist effective and low-cost
+strategies that are remarkably effective at identifying viable lower-disparity
+policies.
+
+
+
+ comment: 23 pages, 4 figures, 1 table. Prior versions appeared at NeurIPS
+ Algorithmic Fairness Through the Lens of Metrics and Evaluation Workshop
+ (AFME 2024) and Regulatable ML Workshop (RegML 2024). Forthcoming at ACM
+ CS&Law 2025
+
+
+
+
+
+
+ ☆ Learning Randomized Reductions and Program Properties
+
+
+ The correctness of computations remains a significant challenge in computer
+science, with traditional approaches relying on automated testing or formal
+verification. Self-testing/correcting programs introduce an alternative
+paradigm, allowing a program to verify and correct its own outputs via
+randomized reductions, a concept that previously required manual derivation. In
+this paper, we present Bitween, a method and tool for automated learning of
+randomized (self)-reductions and program properties in numerical programs.
+Bitween combines symbolic analysis and machine learning, with a surprising
+finding: polynomial-time linear regression, a basic optimization method, is not
+only sufficient but also highly effective for deriving complex randomized
+self-reductions and program invariants, often outperforming sophisticated
+mixed-integer linear programming solvers. We establish a theoretical framework
+for learning these reductions and introduce RSR-Bench, a benchmark suite for
+evaluating Bitween's capabilities on scientific and machine learning functions.
+Our empirical results show that Bitween surpasses state-of-the-art tools in
+scalability, stability, and sample efficiency when evaluated on nonlinear
+invariant benchmarks like NLA-DigBench. Bitween is open-source as a Python
+package and accessible via a web interface that supports C language programs.
+
+
+
+
+
+
+
+ ☆ Age Optimal Sampling for Unreliable Channels under Unknown Channel
+ Statistics
+
+
+ In this paper, we study a system in which a sensor forwards status updates to
+a receiver through an error-prone channel, while the receiver sends the
+transmission results back to the sensor via a reliable channel. Both channels
+are subject to random delays. To evaluate the timeliness of the status
+information at the receiver, we use the Age of Information (AoI) metric. The
+objective is to design a sampling policy that minimizes the expected
+time-average AoI, even when the channel statistics (e.g., delay distributions)
+are unknown. We first review the threshold structure of the optimal offline
+policy under known channel statistics and then reformulate the design of the
+online algorithm as a stochastic approximation problem. We propose a
+Robbins-Monro algorithm to solve this problem and demonstrate that the optimal
+threshold can be approximated almost surely. Moreover, we prove that the
+cumulative AoI regret of the online algorithm increases with rate
+$\mathcal{O}(\ln K)$, where $K$ is the number of successful transmissions. In
+addition, our algorithm is shown to be minimax order optimal, in the sense that
+for any online learning algorithm, the cumulative AoI regret up to the $K$-th
+successful transmissions grows with the rate at least $\Omega(\ln K)$ in the
+worst case delay distribution. Finally, we improve the stability of the
+proposed online learning algorithm through a momentum-based stochastic gradient
+descent algorithm. Simulation results validate the performance of our proposed
+algorithm.
+
+
+
+
+
+
+
+ ♻ ☆ Principal Component Flow Map Learning of PDEs from Incomplete, Limited,
+ and Noisy Data
+
+
+ We present a computational technique for modeling the evolution of dynamical
+systems in a reduced basis, with a focus on the challenging problem of modeling
+partially-observed partial differential equations (PDEs) on high-dimensional
+non-uniform grids. We address limitations of previous work on data-driven flow
+map learning in the sense that we focus on noisy and limited data to move
+toward data collection scenarios in real-world applications. Leveraging recent
+work on modeling PDEs in modal and nodal spaces, we present a neural network
+structure that is suitable for PDE modeling with noisy and limited data
+available only on a subset of the state variables or computational domain. In
+particular, spatial grid-point measurements are reduced using a learned linear
+transformation, after which the dynamics are learned in this reduced basis
+before being transformed back out to the nodal space. This approach yields a
+drastically reduced parameterization of the neural network compared with
+previous flow map models for nodal space learning. This allows for rapid
+high-resolution simulations, enabled by smaller training data sets and reduced
+training times.
+
+
+
+
+
+
+
+
+ Muhammad Rajabinasab, Anton D. Lautrup, Tobias Hyrup, Arthur Zimek
+
+
+ Expressive evaluation metrics are indispensable for informative experiments
+in all areas, and while several metrics are established in some areas, in
+others, such as feature selection, only indirect or otherwise limited
+evaluation metrics are found. In this paper, we propose a novel evaluation
+metric to address several problems of its predecessors and allow for flexible
+and reliable evaluation of feature selection algorithms. The proposed metric is
+a dynamic metric with two properties that can be used to evaluate both the
+performance and the stability of a feature selection algorithm. We conduct
+several empirical experiments to illustrate the use of the proposed metric in
+the successful evaluation of feature selection algorithms. We also provide a
+comparison and analysis to show the different aspects involved in the
+evaluation of the feature selection algorithms. The results indicate that the
+proposed metric is successful in carrying out the evaluation task for feature
+selection algorithms.
+ This paper is an extended version of a paper published at SISAP 2024.
+
+
+
+ comment: Short version of this paper is published at 17th International
+ Conference on Similarity Search and Applications, SISAP 2024
+
+
+
+
+
+
+ ♻ ☆ SpikingSSMs: Learning Long Sequences with Sparse and Parallel Spiking
+ State Space Models
+
+
+ Known as low energy consumption networks, spiking neural networks (SNNs) have
+gained a lot of attention within the past decades. While SNNs are increasing
+competitive with artificial neural networks (ANNs) for vision tasks, they are
+rarely used for long sequence tasks, despite their intrinsic temporal dynamics.
+In this work, we develop spiking state space models (SpikingSSMs) for long
+sequence learning by leveraging on the sequence learning abilities of state
+space models (SSMs). Inspired by dendritic neuron structure, we hierarchically
+integrate neuronal dynamics with the original SSM block, meanwhile realizing
+sparse synaptic computation. Furthermore, to solve the conflict of event-driven
+neuronal dynamics with parallel computing, we propose a light-weight surrogate
+dynamic network which accurately predicts the after-reset membrane potential
+and compatible to learnable thresholds, enabling orders of acceleration in
+training speed compared with conventional iterative methods. On the long range
+arena benchmark task, SpikingSSM achieves competitive performance to
+state-of-the-art SSMs meanwhile realizing on average 90\% of network sparsity.
+On language modeling, our network significantly surpasses existing spiking
+large language models (spikingLLMs) on the WikiText-103 dataset with only a
+third of the model size, demonstrating its potential as backbone architecture
+for low computation cost LLMs.
+
+
+
+
+
+
+
+
+ Yujie Zhao, Jose Efraim Aguilar Escamill, Weyl Lu, Huazheng Wang
+
+
+ Reinforcement Learning Human Feedback (RLHF) studies the problem where agents
+receive only preferences over pairs of trajectories in each episode.
+Traditional approaches in this field have predominantly focused on the mean
+reward or utility criterion. However, in RLHF scenarios demanding heightened
+risk awareness, such as in AI systems, healthcare, and agriculture, risk-aware
+measures are requisite. Traditional risk-aware objectives and algorithms are
+not applicable in such one-episode-reward settings. To address this, we explore
+and prove the applicability of two risk-aware objectives to RLHF: nested and
+static quantile risk objectives. We also introduce Risk-Aware-RLHF (RA-RLHF),
+an algorithm designed to optimize both nested and static objectives.
+Additionally, we provide a theoretical analysis of the regret upper bounds,
+demonstrating that they are sublinear with respect to the number of episodes,
+and present empirical results to support our findings. Our code is available in
+https://github.com/aguilarjose11/pbrlNeurips.
+
+
+ Deep neural networks typically rely on a single forward pass for inference,
+which can limit their capacity to resolve ambiguous inputs. We introduce
+Contextual Backpropagation Loops (CBLs) as an iterative mechanism that
+incorporates top-down feedback to refine intermediate representations, thereby
+improving accuracy and robustness. This repeated process mirrors how humans
+continuously re-interpret sensory information in daily life-by checking and
+re-checking our perceptions using contextual cues. Our results suggest that
+CBLs can offer a straightforward yet powerful way to incorporate such
+contextual reasoning in modern deep learning architectures.
+
+
+
+
+
+
+
+ ♻ ☆ Deep Adaptive Interest Network: Personalized Recommendation with
+ Context-Aware Learning
+
+
+
+
+
+
+
+
+ Shuaishuai Huang, Haowei Yang, You Yao, Xueting Lin, Yuming Tu
+
+
+ In personalized recommendation systems, accurately capturing users' evolving
+interests and combining them with contextual information is a critical research
+area. This paper proposes a novel model called the Deep Adaptive Interest
+Network (DAIN), which dynamically models users' interests while incorporating
+context-aware learning mechanisms to achieve precise and adaptive personalized
+recommendations. DAIN leverages deep learning techniques to build an adaptive
+interest network structure that can capture users' interest changes in
+real-time while further optimizing recommendation results by integrating
+contextual information. Experiments conducted on several public datasets
+demonstrate that DAIN excels in both recommendation performance and
+computational efficiency. This research not only provides a new solution for
+personalized recommendation systems but also offers fresh insights into the
+application of context-aware learning in recommendation systems.
+
+
+
+
+
+
+
+ ♻ ☆ MrSteve: Instruction-Following Agents in Minecraft with What-Where-When
+ Memory
+
+
+ Significant advances have been made in developing general-purpose embodied AI
+in environments like Minecraft through the adoption of LLM-augmented
+hierarchical approaches. While these approaches, which combine high-level
+planners with low-level controllers, show promise, low-level controllers
+frequently become performance bottlenecks due to repeated failures. In this
+paper, we argue that the primary cause of failure in many low-level controllers
+is the absence of an episodic memory system. To address this, we introduce
+MrSteve (Memory Recall Steve-1), a novel low-level controller equipped with
+Place Event Memory (PEM), a form of episodic memory that captures what, where,
+and when information from episodes. This directly addresses the main limitation
+of the popular low-level controller, Steve-1. Unlike previous models that rely
+on short-term memory, PEM organizes spatial and event-based data, enabling
+efficient recall and navigation in long-horizon tasks. Additionally, we propose
+an Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing
+agents to alternate between exploration and task-solving based on recalled
+events. Our approach significantly improves task-solving and exploration
+efficiency compared to existing methods. We will release our code and demos on
+the project page: https://sites.google.com/view/mr-steve.
+
+
+
+
+
+
+
+
+ Yuchen He, Chuyun Shen, Xiangfeng Wang, Bo Jin
+
+
+ Federated continual learning (FCL) aims to learn from sequential data stream
+in the decentralized federated learning setting, while simultaneously
+mitigating the catastrophic forgetting issue in classical continual learning.
+Existing FCL methods usually employ typical rehearsal mechanisms, which could
+result in privacy violations or additional onerous storage and computational
+burdens. In this work, an efficient and non-IID robust federated continual
+learning framework, called Federated Prototype-Augmented Prompt Learning
+(FPPL), is proposed. The FPPL can collaboratively learn lightweight prompts
+augmented by prototypes without rehearsal. On the client side, a fusion
+function is employed to fully leverage the knowledge contained in task-specific
+prompts for alleviating catastrophic forgetting. Additionally, global
+prototypes aggregated from the server are used to obtain unified representation
+through contrastive learning, mitigating the impact of non-IID-derived data
+heterogeneity. On the server side, locally uploaded prototypes are utilized to
+perform debiasing on the classifier, further alleviating the performance
+degradation caused by both non-IID and catastrophic forgetting. Empirical
+evaluations demonstrate the effectiveness of FPPL, achieving notable
+performance with an efficient design while remaining robust to diverse non-IID
+degrees. Code is available at: https://github.com/ycheoo/FPPL.
+
+
+
+
+
+
+
+ ♻ ☆ Data-driven decision-making under uncertainty with entropic risk measure
+
+
+ The entropic risk measure is widely used in high-stakes decision making to
+account for tail risks associated with an uncertain loss. With limited data,
+the empirical entropic risk estimator, i.e. replacing the expectation in the
+entropic risk measure with a sample average, underestimates the true risk. To
+debias the empirical entropic risk estimator, we propose a strongly
+asymptotically consistent bootstrapping procedure. The first step of the
+procedure involves fitting a distribution to the data, whereas the second step
+estimates the bias of the empirical entropic risk estimator using
+bootstrapping, and corrects for it. We show that naively fitting a Gaussian
+Mixture Model to the data using the maximum likelihood criterion typically
+leads to an underestimation of the risk. To mitigate this issue, we consider
+two alternative methods: a more computationally demanding one that fits the
+distribution of empirical entropic risk, and a simpler one that fits the
+extreme value distribution. As an application of the approach, we study a
+distributionally robust entropic risk minimization problem with type-$\infty$
+Wasserstein ambiguity set, where debiasing the validation performance using our
+techniques significantly improves the calibration of the size of the ambiguity
+set. Furthermore, we propose a distributionally robust optimization model for a
+well-studied insurance contract design problem. The model considers multiple
+(potential) policyholders that have dependent risks and the insurer and
+policyholders use entropic risk measure. We show that cross validation methods
+can result in significantly higher out-of-sample risk for the insurer if the
+bias in validation performance is not corrected for. This improvement can be
+explained from the observation that our methods suggest a higher (and more
+accurate) premium to homeowners.
+
+
+ In the realm of graph learning, there is a category of methods that
+conceptualize graphs as hierarchical structures, utilizing node clustering to
+capture broader structural information. While generally effective, these
+methods often rely on a fixed graph coarsening routine, leading to overly
+homogeneous cluster representations and loss of node-level information. In this
+paper, we envision the graph as a network of interconnected node sets without
+compressing each cluster into a single embedding. To enable effective
+information transfer among these node sets, we propose the Node-to-Cluster
+Attention (N2C-Attn) mechanism. N2C-Attn incorporates techniques from Multiple
+Kernel Learning into the kernelized attention framework, effectively capturing
+information at both node and cluster levels. We then devise an efficient form
+for N2C-Attn using the cluster-wise message-passing framework, achieving linear
+time complexity. We further analyze how N2C-Attn combines bi-level feature maps
+of queries and keys, demonstrating its capability to merge dual-granularity
+information. The resulting architecture, Cluster-wise Graph Transformer
+(Cluster-GT), which uses node clusters as tokens and employs our proposed
+N2C-Attn module, shows superior performance on various graph-level tasks. Code
+is available at https://github.com/LUMIA-Group/Cluster-wise-Graph-Transformer.
+
+
+
+
+
+
+
+
+ Branislava Lalic, Dinh Viet Cuong, Mina Petric, Vladimir Pavlovic, Ana Firanj Sremac, Mark Roantree
+
+
+ Physics-based dynamic models (PBDMs) are simplified representations of
+complex dynamical systems. PBDMs take specific processes within a complex
+system and assign a fragment of variables and an accompanying set of parameters
+to depict the processes. As this often leads to suboptimal parameterisation of
+the system, a key challenge requires refining the empirical parameters and
+variables to reduce uncertainties while maintaining the model s explainability
+and enhancing its predictive accuracy. We demonstrate that a hybrid mosquito
+population dynamics model, which integrates a PBDM with Physics-Informed Neural
+Networks (PINN), retains the explainability of the PBDM by incorporating the
+PINN-learned model parameters in place of its empirical counterparts.
+Specifically, we address the limitations of traditional PBDMs by modelling the
+parameters of larva and pupa development rates using a PINN that encodes
+complex, learned interactions of air temperature, precipitation and humidity.
+Our results demonstrate improved mosquito population simulations including the
+difficult-to-predict mosquito population peaks. This opens the possibility of
+hybridisation concept application on other complex systems based on PBDMs such
+as cancer growth to address the challenges posed by scarce and noisy data, and
+to numerical weather prediction and climate modelling to overcome the gap
+between physics-based and data-driven weather prediction models.
+
+
+
+
+
+
+
+
+ Daniel Nahmias, Gal Engelberg, Dan Klein, Asaf Shabtai
+
+
+ Spear-phishing attacks present a significant security challenge, with large
+language models (LLMs) escalating the threat by generating convincing emails
+and facilitating target reconnaissance. To address this, we propose a detection
+approach based on a novel document vectorization method that utilizes an
+ensemble of LLMs to create representation vectors. By prompting LLMs to reason
+and respond to human-crafted questions, we quantify the presence of common
+persuasion principles in the email's content, producing prompted contextual
+document vectors for a downstream supervised machine learning model. We
+evaluate our method using a unique dataset generated by a proprietary system
+that automates target reconnaissance and spear-phishing email creation. Our
+method achieves a 91\% F1 score in identifying LLM-generated spear-phishing
+emails, with the training set comprising only traditional phishing and benign
+emails. Key contributions include a novel document vectorization method
+utilizing LLM reasoning, a publicly available dataset of high-quality
+spear-phishing emails, and the demonstrated effectiveness of our method in
+detecting such emails. This methodology can be utilized for various document
+classification tasks, particularly in adversarial problem domains.
+
+
+
+
+
+
+
+ ♻ ☆ TableRAG: Million-Token Table Understanding with Language Models NeurIPS 2024
+
+
+ Recent advancements in language models (LMs) have notably enhanced their
+ability to reason with tabular data, primarily through program-aided mechanisms
+that manipulate and analyze tables. However, these methods often require the
+entire table as input, leading to scalability challenges due to the positional
+bias or context length constraints. In response to these challenges, we
+introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework
+specifically designed for LM-based table understanding. TableRAG leverages
+query expansion combined with schema and cell retrieval to pinpoint crucial
+information before providing it to the LMs. This enables more efficient data
+encoding and precise retrieval, significantly reducing prompt lengths and
+mitigating information loss. We have developed two new million-token benchmarks
+from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's
+effectiveness at scale. Our results demonstrate that TableRAG's retrieval
+design achieves the highest retrieval quality, leading to the new
+state-of-the-art performance on large-scale table understanding.
+
+
+ Transformer-based large language models (LLMs) use the key-value (KV) cache
+to significantly accelerate inference by storing the key and value embeddings
+of past tokens. However, this cache consumes significant GPU memory. In this
+work, we introduce HashEvict, an algorithm that uses locality-sensitive hashing
+(LSH) to compress the KV cache. HashEvict quickly locates tokens in the cache
+that are cosine dissimilar to the current query token. This is achieved by
+computing the Hamming distance between binarized Gaussian projections of the
+current token query and cached token keys, with a projection length much
+smaller than the embedding dimension. We maintain a lightweight binary
+structure in GPU memory to facilitate these calculations. Unlike existing
+compression strategies that compute attention to determine token retention,
+HashEvict makes these decisions pre-attention, thereby reducing computational
+costs. Additionally, HashEvict is dynamic - at every decoding step, the key and
+value of the current token replace the embeddings of a token expected to
+produce the lowest attention score. We demonstrate that HashEvict can compress
+the KV cache by 30%-70% while maintaining high performance across reasoning,
+multiple-choice, long-context retrieval and summarization tasks.
+
+
+
+ comment: 10 pages, 6 figures, 2 tables
+
+
+
+
+
+
+ ♻ ☆ Enhancing the Performance of Neural Networks Through Causal Discovery
+ and Integration of Domain Knowledge
+
+
+ In this paper, we develop a generic methodology to encode hierarchical
+causality structure among observed variables into a neural network in order to
+improve its predictive performance. The proposed methodology, called
+causality-informed neural network (CINN), leverages three coherent steps to
+systematically map the structural causal knowledge into the layer-to-layer
+design of neural network while strictly preserving the orientation of every
+causal relationship. In the first step, CINN discovers causal relationships
+from observational data via directed acyclic graph (DAG) learning, where causal
+discovery is recast as a continuous optimization problem to avoid the
+combinatorial nature. In the second step, the discovered hierarchical causality
+structure among observed variables is systematically encoded into neural
+network through a dedicated architecture and customized loss function. By
+categorizing variables in the causal DAG as root, intermediate, and leaf nodes,
+the hierarchical causal DAG is translated into CINN with a one-to-one
+correspondence between nodes in the causal DAG and units in the CINN while
+maintaining the relative order among these nodes. Regarding the loss function,
+both intermediate and leaf nodes in the DAG graph are treated as target outputs
+during CINN training so as to drive co-learning of causal relationships among
+different types of nodes. As multiple loss components emerge in CINN, we
+leverage the projection of conflicting gradients to mitigate gradient
+interference among the multiple learning tasks. Computational experiments
+across a broad spectrum of UCI data sets demonstrate substantial advantages of
+CINN in predictive performance over other state-of-the-art methods. In
+addition, an ablation study underscores the value of integrating structural and
+quantitative causal knowledge in enhancing the neural network's predictive
+performance incrementally.
+
+
+
+
+
+
+
+
+ Hendrik Poulsen Nautrup, Hans J. Briegel
+
+
+ Measurement-based quantum computation (MBQC) is a paradigm for quantum
+computation where computation is driven by local measurements on a suitably
+entangled resource state. In this work we show that MBQC is related to a model
+of quantum computation based on Clifford quantum cellular automata (CQCA).
+Specifically, we show that certain MBQCs can be directly constructed from CQCAs
+which yields a simple and intuitive circuit model representation of MBQC in
+terms of quantum computation based on CQCA. We apply this description to
+construct various MBQC-based Ans\"atze for parameterized quantum circuits,
+demonstrating that the different Ans\"atze may lead to significantly different
+performances on different learning tasks. In this way, MBQC yields a family of
+Hardware-efficient Ans\"atze that may be adapted to specific problem settings
+and is particularly well suited for architectures with translationally
+invariant gates such as neutral atoms.
+
+
+
+ comment: 16 pages, 12 figures
+
+
+
+
+
+
+ ♻ ☆ ARC: A Generalist Graph Anomaly Detector with In-Context Learning
+
+
+ Graph anomaly detection (GAD), which aims to identify abnormal nodes that
+differ from the majority within a graph, has garnered significant attention.
+However, current GAD methods necessitate training specific to each dataset,
+resulting in high training costs, substantial data requirements, and limited
+generalizability when being applied to new datasets and domains. To address
+these limitations, this paper proposes ARC, a generalist GAD approach that
+enables a ``one-for-all'' GAD model to detect anomalies across various graph
+datasets on-the-fly. Equipped with in-context learning, ARC can directly
+extract dataset-specific patterns from the target dataset using few-shot normal
+samples at the inference stage, without the need for retraining or fine-tuning
+on the target dataset. ARC comprises three components that are well-crafted for
+capturing universal graph anomaly patterns: 1) smoothness-based feature
+Alignment module that unifies the features of different datasets into a common
+and anomaly-sensitive space; 2) ego-neighbor Residual graph encoder that learns
+abnormality-related node embeddings; and 3) cross-attentive in-Context anomaly
+scoring module that predicts node abnormality by leveraging few-shot normal
+samples. Extensive experiments on multiple benchmark datasets from various
+domains demonstrate the superior anomaly detection performance, efficiency, and
+generalizability of ARC.
+
+
+
+ comment: 25 pages, 10 figures
+
+
+
+
+
+
+ ♻ ☆ Hierarchical Classification Auxiliary Network for Time Series
+ Forecasting
+
+
+ Deep learning has significantly advanced time series forecasting through its
+powerful capacity to capture sequence relationships. However, training these
+models with the Mean Square Error (MSE) loss often results in over-smooth
+predictions, making it challenging to handle the complexity and learn
+high-entropy features from time series data with high variability and
+unpredictability. In this work, we introduce a novel approach by tokenizing
+time series values to train forecasting models via cross-entropy loss, while
+considering the continuous nature of time series data. Specifically, we propose
+a Hierarchical Classification Auxiliary Network, HCAN, a general model-agnostic
+component that can be integrated with any forecasting model. HCAN is based on a
+Hierarchy-Aware Attention module that integrates multi-granularity high-entropy
+features at different hierarchy levels. At each level, we assign a class label
+for timesteps to train an Uncertainty-Aware Classifier. This classifier
+mitigates the over-confidence in softmax loss via evidence theory. We also
+implement a Hierarchical Consistency Loss to maintain prediction consistency
+across hierarchy levels. Extensive experiments integrating HCAN with
+state-of-the-art forecasting models demonstrate substantial improvements over
+baselines on several real-world datasets.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Facets of Language Generation in the Limit
+
+
+ The recent work of Kleinberg & Mullainathan [KM24] provides a concrete model
+for language generation in the limit: given a sequence of examples from an
+unknown target language, the goal is to generate new examples from the target
+language such that no incorrect examples are generated beyond some point. In
+sharp contrast to strong negative results for the closely related problem of
+language identification, they establish positive results for language
+generation in the limit for all countable collections of languages. Follow-up
+work by Raman & Tewari [RT24] studies bounds on the number of distinct inputs
+required by an algorithm before correct language generation is achieved --
+namely, whether this is a constant for all languages in the collection (uniform
+generation) or a language-dependent constant (non-uniform generation).
+ We show that every countable language collection has a generator which has
+the stronger property of non-uniform generation in the limit. However, while
+the generation algorithm of [KM24] can be implemented using membership queries,
+we show that any algorithm cannot non-uniformly generate even for collections
+of just two languages, using only membership queries.
+ We also formalize the tension between validity and breadth in the generation
+algorithm of [KM24] by introducing a definition of exhaustive generation, and
+show a strong negative result for exhaustive generation. Our result shows that
+a tradeoff between validity and breadth is inherent for generation in the
+limit. We also provide a precise characterization of the language collections
+for which exhaustive generation is possible. Finally, inspired by algorithms
+that can choose to obtain feedback, we consider a model of uniform generation
+with feedback, completely characterizing language collections for which such
+uniform generation with feedback is possible in terms of a complexity measure
+of the collection.
+
+
+
+ comment: 31 pages. Fixed typos, updated related work, added results on
+ characterization of exhaustive generation
+
+
+
+
+
+
+ ♻ ☆ Level Up with ML Vulnerability Identification: Leveraging Domain
+ Constraints in Feature Space for Robust Android Malware Detection
+
+
+ Machine Learning (ML) promises to enhance the efficacy of Android Malware
+Detection (AMD); however, ML models are vulnerable to realistic evasion
+attacks--crafting realizable Adversarial Examples (AEs) that satisfy Android
+malware domain constraints. To eliminate ML vulnerabilities, defenders aim to
+identify susceptible regions in the feature space where ML models are prone to
+deception. The primary approach to identifying vulnerable regions involves
+investigating realizable AEs, but generating these feasible apps poses a
+challenge. For instance, previous work has relied on generating either
+feature-space norm-bounded AEs or problem-space realizable AEs in adversarial
+hardening. The former is efficient but lacks full coverage of vulnerable
+regions while the latter can uncover these regions by satisfying domain
+constraints but is known to be time-consuming. To address these limitations, we
+propose an approach to facilitate the identification of vulnerable regions.
+Specifically, we introduce a new interpretation of Android domain constraints
+in the feature space, followed by a novel technique that learns them. Our
+empirical evaluations across various evasion attacks indicate effective
+detection of AEs using learned domain constraints, with an average of 89.6%.
+Furthermore, extensive experiments on different Android malware detectors
+demonstrate that utilizing our learned domain constraints in Adversarial
+Training (AT) outperforms other AT-based defenses that rely on norm-bounded AEs
+or state-of-the-art non-uniform perturbations. Finally, we show that retraining
+a malware detector with a wide variety of feature-space realizable AEs results
+in a 77.9% robustness improvement against realizable AEs generated by unknown
+problem-space transformations, with up to 70x faster training than using
+problem-space realizable AEs.
+
+
+
+ comment: The paper was accepted by ACM Transactions on Privacy and Security on
+ 2 December 2024
+
+
+
+
+
+
+ ♻ ☆ Applications of Scientific Machine Learning for the Analysis of
+ Functionally Graded Porous Beams
+
+
+ This study investigates different Scientific Machine Learning (SciML)
+approaches for the analysis of functionally graded (FG) porous beams and
+compares them under a new framework. The beam material properties are assumed
+to vary as an arbitrary continuous function. The methods consider the output of
+a neural network/operator as an approximation to the displacement fields and
+derive the equations governing beam behavior based on the continuum
+formulation. The methods are implemented in the framework and formulated by
+three approaches: (a) the vector approach leads to a Physics-Informed Neural
+Network (PINN), (b) the energy approach brings about the Deep Energy Method
+(DEM), and (c) the data-driven approach, which results in a class of Neural
+Operator methods. Finally, a neural operator has been trained to predict the
+response of the porous beam with functionally graded material under any
+porosity distribution pattern and any arbitrary traction condition. The results
+are validated with analytical and numerical reference solutions. The data and
+code accompanying this manuscript will be publicly available at
+https://github.com/eshaghi-ms/DeepNetBeam.
+
+
+ Bi-level optimization (BO) has become a fundamental mathematical framework
+for addressing hierarchical machine learning problems. As deep learning models
+continue to grow in size, the demand for scalable bi-level optimization
+solutions has become increasingly critical. Traditional gradient-based bi-level
+optimization algorithms, due to their inherent characteristics, are ill-suited
+to meet the demands of large-scale applications. In this paper, we introduce
+$\textbf{F}$orward $\textbf{G}$radient $\textbf{U}$nrolling with
+$\textbf{F}$orward $\textbf{F}$radient, abbreviated as
+$(\textbf{FG})^2\textbf{U}$, which achieves an unbiased stochastic
+approximation of the meta gradient for bi-level optimization.
+$(\text{FG})^2\text{U}$ circumvents the memory and approximation issues
+associated with classical bi-level optimization approaches, and delivers
+significantly more accurate gradient estimates than existing large-scale
+bi-level optimization approaches. Additionally, $(\text{FG})^2\text{U}$ is
+inherently designed to support parallel computing, enabling it to effectively
+leverage large-scale distributed computing systems to achieve significant
+computational efficiency. In practice, $(\text{FG})^2\text{U}$ and other
+methods can be strategically placed at different stages of the training process
+to achieve a more cost-effective two-phase paradigm. Further,
+$(\text{FG})^2\text{U}$ is easy to implement within popular deep learning
+frameworks, and can be conveniently adapted to address more challenging
+zeroth-order bi-level optimization scenarios. We provide a thorough convergence
+analysis and a comprehensive practical discussion for $(\text{FG})^2\text{U}$,
+complemented by extensive empirical evaluations, showcasing its superior
+performance in diverse large-scale bi-level optimization tasks. Code is
+available at https://github.com/ShenQianli/FG2U.
+
+
+
+
+
+
+
+
+ Trung Trinh, Markus Heinonen, Luigi Acerbi, Samuel Kaski
+
+
+ Deep neural networks (DNNs) excel on clean images but struggle with corrupted
+ones. Incorporating specific corruptions into the data augmentation pipeline
+can improve robustness to those corruptions but may harm performance on clean
+images and other types of distortion. In this paper, we introduce an
+alternative approach that improves the robustness of DNNs to a wide range of
+corruptions without compromising accuracy on clean images. We first demonstrate
+that input perturbations can be mimicked by multiplicative perturbations in the
+weight space. Leveraging this, we propose Data Augmentation via Multiplicative
+Perturbation (DAMP), a training method that optimizes DNNs under random
+multiplicative weight perturbations. We also examine the recently proposed
+Adaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs
+under adversarial multiplicative weight perturbations. Experiments on image
+classification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural
+network architectures (ResNet50, ViT-S/16, ViT-B/16) show that DAMP enhances
+model generalization performance in the presence of corruptions across
+different settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from
+scratch, reaching the top-1 error of 23.7% which is comparable to ResNet50
+without extensive data augmentations.
+
+
+
+ comment: Published at NeurIPS 2024 (spotlight). Code is available at
+ https://github.com/trungtrinh44/DAMP
+
+
+
+
+
+
+ ♻ ☆ DelGrad: Exact event-based gradients in spiking networks for training
+ delays and weights
+
+
+
+
+
+
+
+
+ Julian Göltz, Jimmy Weber, Laura Kriener, Sebastian Billaudelle, Peter Lake, Johannes Schemmel, Melika Payvand, Mihai A. Petrovici
+
+
+ Spiking neural networks (SNNs) inherently rely on the timing of signals for
+representing and processing information. Incorporating trainable transmission
+delays, alongside synaptic weights, is crucial for shaping these temporal
+dynamics. While recent methods have shown the benefits of training delays and
+weights in terms of accuracy and memory efficiency, they rely on discrete time,
+approximate gradients, and full access to internal variables like membrane
+potentials. This limits their precision, efficiency, and suitability for
+neuromorphic hardware due to increased memory requirements and I/O bandwidth
+demands. To address these challenges, we propose DelGrad, an analytical,
+event-based method to compute exact loss gradients for both synaptic weights
+and delays. The inclusion of delays in the training process emerges naturally
+within our proposed formalism, enriching the model's search space with a
+temporal dimension. Moreover, DelGrad, grounded purely in spike timing,
+eliminates the need to track additional variables such as membrane potentials.
+To showcase this key advantage, we demonstrate the functionality and benefits
+of DelGrad on the BrainScaleS-2 neuromorphic platform, by training SNNs in a
+chip-in-the-loop fashion. For the first time, we experimentally demonstrate the
+memory efficiency and accuracy benefits of adding delays to SNNs on noisy
+mixed-signal hardware. Additionally, these experiments also reveal the
+potential of delays for stabilizing networks against noise. DelGrad opens a new
+way for training SNNs with delays on neuromorphic hardware, which results in
+less number of required parameters, higher accuracy and ease of hardware
+training.
+
+
+
+ comment: 22 pages, 11 figures
+
+
+
+
+
+
+ ♻ ☆ Zero-Shot Conditioning of Score-Based Diffusion Models by Neuro-Symbolic
+ Constraints
+
+
+ Score-based diffusion models have emerged as effective approaches for both
+conditional and unconditional generation. Still conditional generation is based
+on either a specific training of a conditional model or classifier guidance,
+which requires training a noise-dependent classifier, even when a classifier
+for uncorrupted data is given. We propose a method that, given a pre-trained
+unconditional score-based generative model, samples from the conditional
+distribution under arbitrary logical constraints, without requiring additional
+training. Differently from other zero-shot techniques, that rather aim at
+generating valid conditional samples, our method is designed for approximating
+the true conditional distribution. Firstly, we show how to manipulate the
+learned score in order to sample from an un-normalized distribution conditional
+on a user-defined constraint. Then, we define a flexible and numerically stable
+neuro-symbolic framework for encoding soft logical constraints. Combining these
+two ingredients we obtain a general, but approximate, conditional sampling
+algorithm. We further developed effective heuristics aimed at improving the
+approximation. Finally, we show the effectiveness of our approach in
+approximating conditional distributions for various types of constraints and
+data: tabular data, images and time series.
+
+
+
+
+
+
+
+ ♻ ☆ Go With the Flow: Fast Diffusion for Gaussian Mixture Models
+
+
+
+
+
+
+
+
+ George Rapakoulias, Ali Reza Pedram, Panagiotis Tsiotras
+
+
+ Schr\"{o}dinger Bridges (SB) are diffusion processes that steer, in finite
+time, a given initial distribution to another final one while minimizing a
+suitable cost functional. Although various methods for computing SBs have
+recently been proposed in the literature, most of these approaches require
+computationally expensive training schemes, even for solving low-dimensional
+problems. In this work, we propose an analytic parametrization of a set of
+feasible policies for steering the distribution of a dynamical system from one
+Gaussian Mixture Model (GMM) to another. Instead of relying on standard
+non-convex optimization techniques, the optimal policy within the set can be
+approximated as the solution of a low-dimensional linear program whose
+dimension scales linearly with the number of components in each mixture.
+Furthermore, our method generalizes naturally to more general classes of
+dynamical systems such as controllable Linear Time-Varying systems that cannot
+currently be solved using traditional neural SB approaches. We showcase the
+potential of this approach in low-to-moderate dimensional problems such as
+image-to-image translation in the latent space of an autoencoder, and various
+other examples. We also benchmark our approach on an Entropic Optimal Transport
+(EOT) problem and show that it outperforms state-of-the-art methods in cases
+where the boundary distributions are mixture models while requiring virtually
+no training.
+
+
+
+
+
+
+
+
+ Tina Dorosti, Manuel Schultheiss, Felix Hofmann, Johannes Thalhammer, Luisa Kirchner, Theresa Urban, Franz Pfeiffer, Florian Schaff, Tobias Lasser, Daniela Pfeiffer
+
+
+ We aim to optimize the binary detection of Chronic Obstructive Pulmonary
+Disease (COPD) based on emphysema presence in the lung with convolutional
+neural networks (CNN) by exploring manually adjusted versus automated
+window-setting optimization (WSO) on computed tomography (CT) images. 7,194 CT
+images (3,597 with COPD; 3,597 healthy controls) from 78 subjects were selected
+retrospectively (10.2018-12.2021) and preprocessed. For each image, intensity
+values were manually clipped to the emphysema window setting and a baseline
+'full-range' window setting. Class-balanced train, validation, and test sets
+contained 3,392, 1,114, and 2,688 images. The network backbone was optimized by
+comparing various CNN architectures. Furthermore, automated WSO was implemented
+by adding a customized layer to the model. The image-level area under the
+Receiver Operating Characteristics curve (AUC) [lower, upper limit 95%
+confidence] was utilized to compare model variations. Repeated inference (n=7)
+on the test set showed that the DenseNet was the most efficient backbone and
+achieved a mean AUC of 0.80 [0.76, 0.85] without WSO. Comparably, with input
+images manually adjusted to the emphysema window, the DenseNet model predicted
+COPD with a mean AUC of 0.86 [0.82, 0.89]. By adding a customized WSO layer to
+the DenseNet, an optimal window in the proximity of the emphysema window
+setting was learned automatically, and a mean AUC of 0.82 [0.78, 0.86] was
+achieved. Detection of COPD with DenseNet models was improved by WSO of CT data
+to the emphysema window setting range.
+
+
+
+
+
+
+
+ ♻ ☆ Towards An Unsupervised Learning Scheme for Efficiently Solving
+ Parameterized Mixed-Integer Programs
+
+
+ In this paper, we describe a novel unsupervised learning scheme for
+accelerating the solution of a family of mixed integer programming (MIP)
+problems. Distinct substantially from existing learning-to-optimize methods,
+our proposal seeks to train an autoencoder (AE) for binary variables in an
+unsupervised learning fashion, using data of optimal solutions to historical
+instances for a parametric family of MIPs. By a deliberate design of AE
+architecture and exploitation of its statistical implication, we present a
+simple and straightforward strategy to construct a class of cutting plane
+constraints from the decoder parameters of an offline-trained AE. These
+constraints reliably enclose the optimal binary solutions of new problem
+instances thanks to the representation strength of the AE. More importantly,
+their integration into the primal MIP problem leads to a tightened MIP with the
+reduced feasible region, which can be resolved at decision time using
+off-the-shelf solvers with much higher efficiency. Our method is applied to a
+benchmark batch process scheduling problem formulated as a mixed integer linear
+programming (MILP) problem. Comprehensive results demonstrate that our approach
+significantly reduces the computational cost of off-the-shelf MILP solvers
+while retaining a high solution quality. The codes of this work are
+open-sourced at https://github.com/qushiyuan/AE4BV.
+
+
+
+
+
+
+
+ ♻ ☆ Re-examining learning linear functions in context
+
+
+ In-context learning (ICL) has emerged as a powerful paradigm for easily
+adapting Large Language Models (LLMs) to various tasks. However, our
+understanding of how ICL works remains limited. We explore a simple model of
+ICL in a controlled setup with synthetic training data to investigate ICL of
+univariate linear functions. We experiment with a range of GPT-2-like
+transformer models trained from scratch. Our findings challenge the prevailing
+narrative that transformers adopt algorithmic approaches like linear regression
+to learn a linear function in-context. These models fail to generalize beyond
+their training distribution, highlighting fundamental limitations in their
+capacity to infer abstract task structures. Our experiments lead us to propose
+a mathematically precise hypothesis of what the model might be learning.
+
+
+ Deep Neural Networks (DNNs) have revolutionized artificial intelligence,
+achieving impressive results on diverse data types, including images, videos,
+and texts. However, DNNs still lag behind Gradient Boosting Decision Trees
+(GBDT) on tabular data, a format extensively utilized across various domains.
+In this paper, we propose DOFEN, short for \textbf{D}eep \textbf{O}blivious
+\textbf{F}orest \textbf{EN}semble, a novel DNN architecture inspired by
+oblivious decision trees. DOFEN constructs relaxed oblivious decision trees
+(rODTs) by randomly combining conditions for each column and further enhances
+performance with a two-level rODT forest ensembling process. By employing this
+approach, DOFEN achieves state-of-the-art results among DNNs and further
+narrows the gap between DNNs and tree-based models on the well-recognized
+benchmark: Tabular Benchmark \citep{grinsztajn2022tree}, which includes 73
+total datasets spanning a wide array of domains. The code of DOFEN is available
+at: \url{https://github.com/Sinopac-Digital-Technology-Division/DOFEN}.
+
+
+
+ comment: NeurIPS 2024 (poster); (v2: modify and rearrange sections, propose
+ multihead extension of DOFEN, include new results on tabular benchmark and
+ other benchmarks)
+
+
+
+
+
+
+ ♻ ☆ Perfect Alignment May be Poisonous to Graph Contrastive Learning ICML 24
+
+
+ Graph Contrastive Learning (GCL) aims to learn node representations by
+aligning positive pairs and separating negative ones. However, few of
+researchers have focused on the inner law behind specific augmentations used in
+graph-based learning. What kind of augmentation will help downstream
+performance, how does contrastive learning actually influence downstream tasks,
+and why the magnitude of augmentation matters so much? This paper seeks to
+address these questions by establishing a connection between augmentation and
+downstream performance. Our findings reveal that GCL contributes to downstream
+tasks mainly by separating different classes rather than gathering nodes of the
+same class. So perfect alignment and augmentation overlap which draw all
+intra-class samples the same can not fully explain the success of contrastive
+learning. Therefore, in order to understand how augmentation aids the
+contrastive learning process, we conduct further investigations into the
+generalization, finding that perfect alignment that draw positive pair the same
+could help contrastive loss but is poisonous to generalization, as a result,
+perfect alignment may not lead to best downstream performance, so specifically
+designed augmentation is needed to achieve appropriate alignment performance
+and improve downstream accuracy. We further analyse the result by information
+theory and graph spectrum theory and propose two simple but effective methods
+to verify the theories. The two methods could be easily applied to various GCL
+algorithms and extensive experiments are conducted to prove its effectiveness.
+The code is available at https://github.com/somebodyhh1/GRACEIS
+
+
+
+ comment: ICML 24
+
+
+
+
+
+
+ ♻ ☆ Fast and Interpretable Mortality Risk Scores for Critical Care Patients
+
+
+ Prediction of mortality in intensive care unit (ICU) patients typically
+relies on black box models (that are unacceptable for use in hospitals) or
+hand-tuned interpretable models (that might lead to the loss in performance).
+We aim to bridge the gap between these two categories by building on modern
+interpretable ML techniques to design interpretable mortality risk scores that
+are as accurate as black boxes. We developed a new algorithm, GroupFasterRisk,
+which has several important benefits: it uses both hard and soft direct
+sparsity regularization, it incorporates group sparsity to allow more cohesive
+models, it allows for monotonicity constraint to include domain knowledge, and
+it produces many equally-good models, which allows domain experts to choose
+among them. For evaluation, we leveraged the largest existing public ICU
+monitoring datasets (MIMIC III and eICU). Models produced by GroupFasterRisk
+outperformed OASIS and SAPS II scores and performed similarly to APACHE IV/IVa
+while using at most a third of the parameters. For patients with
+sepsis/septicemia, acute myocardial infarction, heart failure, and acute kidney
+failure, GroupFasterRisk models outperformed OASIS and SOFA. Finally, different
+mortality prediction ML approaches performed better based on variables selected
+by GroupFasterRisk as compared to OASIS variables. GroupFasterRisk's models
+performed better than risk scores currently used in hospitals, and on par with
+black box ML models, while being orders of magnitude sparser. Because
+GroupFasterRisk produces a variety of risk scores, it allows design flexibility
+- the key enabler of practical model creation. GroupFasterRisk is a fast,
+accessible, and flexible procedure that allows learning a diverse set of sparse
+risk scores for mortality prediction.
+
+
+
+ comment: This article has been accepted for publication in the Journal of the
+ American Medical Informatics Association, published by Oxford University
+ Press
+
+ Singing voice synthesis (SVS) system is expected to generate high-fidelity
+singing voice from given music scores (lyrics, duration and pitch). Recently,
+diffusion models have performed well in this field. However, sacrificing
+inference speed to exchange with high-quality sample generation limits its
+application scenarios. In order to obtain high quality synthetic singing voice
+more efficiently, we propose a singing voice synthesis method based on the
+consistency model, ConSinger, to achieve high-fidelity singing voice synthesis
+with minimal steps. The model is trained by applying consistency constraint and
+the generation quality is greatly improved at the expense of a small amount of
+inference speed. Our experiments show that ConSinger is highly competitive with
+the baseline model in terms of generation speed and quality. Audio samples are
+available at https://keylxiao.github.io/consinger.
+
+
+ The uses of machine learning (ML) have snowballed in recent years. In many
+cases, ML models are highly complex, and their operation is beyond the
+understanding of human decision-makers. Nevertheless, some uses of ML models
+involve high-stakes and safety-critical applications. Explainable artificial
+intelligence (XAI) aims to help human decision-makers in understanding the
+operation of such complex ML models, thus eliciting trust in their operation.
+Unfortunately, the majority of past XAI work is based on informal approaches,
+that offer no guarantees of rigor. Unsurprisingly, there exists comprehensive
+experimental and theoretical evidence confirming that informal methods of XAI
+can provide human-decision makers with erroneous information. Logic-based XAI
+represents a rigorous approach to explainability; it is model-based and offers
+the strongest guarantees of rigor of computed explanations. However, a
+well-known drawback of logic-based XAI is the complexity of logic reasoning,
+especially for highly complex ML models. Recent work proposed
+distance-restricted explanations, i.e. explanations that are rigorous provided
+the distance to a given input is small enough. Distance-restricted
+explainability is tightly related with adversarial robustness, and it has been
+shown to scale for moderately complex ML models, but the number of inputs still
+represents a key limiting factor. This paper investigates novel algorithms for
+scaling up the performance of logic-based explainers when computing and
+enumerating ML model explanations with a large number of inputs.
+
+
+
+
+
+
+
+
+ Badr Moufad, Yazid Janati, Lisa Bedin, Alain Durmus, Randal Douc, Eric Moulines, Jimmy Olsson
+
+
+ Diffusion models have recently shown considerable potential in solving
+Bayesian inverse problems when used as priors. However, sampling from the
+resulting denoising posterior distributions remains a challenge as it involves
+intractable terms. To tackle this issue, state-of-the-art approaches formulate
+the problem as that of sampling from a surrogate diffusion model targeting the
+posterior and decompose its scores into two terms: the prior score and an
+intractable guidance term. While the former is replaced by the pre-trained
+score of the considered diffusion model, the guidance term has to be estimated.
+In this paper, we propose a novel approach that utilises a decomposition of the
+transitions which, in contrast to previous methods, allows a trade-off between
+the complexity of the intractable guidance term and that of the prior
+transitions. We validate the proposed approach through extensive experiments on
+linear and nonlinear inverse problems, including challenging cases with latent
+diffusion models as priors. We then demonstrate its applicability to various
+modalities and its promising impact on public health by tackling cardiovascular
+disease diagnosis through the reconstruction of incomplete electrocardiograms.
+The code is publicly available at \url{https://github.com/yazidjanati/mgps}.
+
+
+
+
+
+
+
+ ♻ ☆ On the loss of context-awareness in general instruction fine-tuning
+
+
+ Pre-trained Large Language Models (LLMs) require post-training methods such
+as supervised fine-tuning (SFT) on instruction-response pairs to enable
+instruction following. However, this process can potentially harm existing
+capabilities learned during pre-training. In this paper, we investigate the
+loss of context awareness after SFT, where context awareness is defined as the
+ability to extract and understand information from user-provided context and
+respond accordingly. We are the first to identify and show that the loss of
+context awareness, as reflected by the performance drop in the
+Needle-in-a-Haystack test, occurs in instruction fine-tuned LLMs when the chat
+template is applied to input prompts. We identify that the performance decline
+is partially caused by an attention bias toward different roles learned during
+conversational instruction fine-tuning. We validate our hypothesis by
+visualizing changes in attention allocation after the chat template is applied
+and manually steering the attention heads. Based on these observations, we
+propose a metric to select context-dependent examples from general instruction
+fine-tuning datasets. We then apply conditional instruction fine-tuning with a
+context-dependency indicator, enabling the model to learn context awareness
+from these selected examples. Empirical experiments on four context-dependent
+downstream tasks and three pre-trained LLMs of different sizes show that our
+method effectively mitigates the loss of context awareness without compromising
+general instruction-following capabilities. Given our findings, we strongly
+advocate for careful benchmarking of context awareness after instruction
+fine-tuning.
+
+
+
+
+
+
+
+ ♻ ☆ Integrating Random Effects in Variational Autoencoders for
+ Dimensionality Reduction of Correlated Data
+
+
+ Variational Autoencoders (VAE) are widely used for dimensionality reduction
+of large-scale tabular and image datasets, under the assumption of independence
+between data observations. In practice, however, datasets are often correlated,
+with typical sources of correlation including spatial, temporal and clustering
+structures. Inspired by the literature on linear mixed models (LMM), we propose
+LMMVAE -- a novel model which separates the classic VAE latent model into fixed
+and random parts. While the fixed part assumes the latent variables are
+independent as usual, the random part consists of latent variables which are
+correlated between similar clusters in the data such as nearby locations or
+successive measurements. The classic VAE architecture and loss are modified
+accordingly. LMMVAE is shown to improve squared reconstruction error and
+negative likelihood loss significantly on unseen data, with simulated as well
+as real datasets from various applications and correlation scenarios. It also
+shows improvement in the performance of downstream tasks such as supervised
+classification on the learned representations.
+
+
+
+ comment: 30 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ An Empirical Study: Extensive Deep Temporal Point Process
+
+
+ Temporal point process as the stochastic process on continuous domain of time
+is commonly used to model the asynchronous event sequence featuring with
+occurrence timestamps. Thanks to the strong expressivity of deep neural
+networks, they are emerging as a promising choice for capturing the patterns in
+asynchronous sequences, in the context of temporal point process. In this
+paper, we first review recent research emphasis and difficulties in modeling
+asynchronous event sequences with deep temporal point process, which can be
+concluded into four fields: encoding of history sequence, formulation of
+conditional intensity function, relational discovery of events and learning
+approaches for optimization. We introduce most of recently proposed models by
+dismantling them into the four parts, and conduct experiments by remodularizing
+the first three parts with the same learning strategy for a fair empirical
+evaluation. Besides, we extend the history encoders and conditional intensity
+function family, and propose a Granger causality discovery framework for
+exploiting the relations among multi-types of events. Because the Granger
+causality can be represented by the Granger causality graph, discrete graph
+structure learning in the framework of Variational Inference is employed to
+reveal latent structures of the graph. Further experiments show that the
+proposed framework with latent graph discovery can both capture the relations
+and achieve an improved fitting and predicting performance.
+
+
+
+ comment: 22 pages, 8 figures
+
+
+
+
+
+
+ ♻ ☆ Can Large Language Models Improve the Adversarial Robustness of Graph
+ Neural Networks? KDD 2025
+
+
+ Graph neural networks (GNNs) are vulnerable to adversarial attacks,
+especially for topology perturbations, and many methods that improve the
+robustness of GNNs have received considerable attention. Recently, we have
+witnessed the significant success of large language models (LLMs), leading many
+to explore the great potential of LLMs on GNNs. However, they mainly focus on
+improving the performance of GNNs by utilizing LLMs to enhance the node
+features. Therefore, we ask: Will the robustness of GNNs also be enhanced with
+the powerful understanding and inference capabilities of LLMs? By presenting
+the empirical results, we find that despite that LLMs can improve the
+robustness of GNNs, there is still an average decrease of 23.1% in accuracy,
+implying that the GNNs remain extremely vulnerable against topology attacks.
+Therefore, another question is how to extend the capabilities of LLMs on graph
+adversarial robustness. In this paper, we propose an LLM-based robust graph
+structure inference framework, LLM4RGNN, which distills the inference
+capabilities of GPT-4 into a local LLM for identifying malicious edges and an
+LM-based edge predictor for finding missing important edges, so as to recover a
+robust graph structure. Extensive experiments demonstrate that LLM4RGNN
+consistently improves the robustness across various GNNs. Even in some cases
+where the perturbation ratio increases to 40%, the accuracy of GNNs is still
+better than that on the clean graph. The source code can be found in
+https://github.com/zhongjian-zhang/LLM4RGNN.
+
+
+
+ comment: accepted by KDD 2025
+
+
+
+
+
+
+ ♻ ☆ The Potential of Convolutional Neural Networks for Cancer Detection
+
+
+ Early detection of cancer is critical in improving treatment outcomes and
+increasing survival rates, particularly for common cancers such as lung,
+breast, and prostate which collectively contribute to a significant global
+mortality burden. With advancements in imaging technologies and data
+processing, Convolutional Neural Networks (CNNs) have emerged as a powerful
+tool for analyzing and classifying medical images, enabling more precise cancer
+detection. This paper provides a comprehensive review of recent studies
+leveraging CNN models for detecting ten different types of cancer. Each study
+employs distinct CNN architectures to identify patterns associated with these
+cancers, utilizing diverse datasets. Key differences and strengths of these
+architectures are meticulously compared and analyzed, highlighting their
+efficacy in improving early detection. Beyond reviewing the performance and
+limitations of CNN-based cancer detection methods, this study explores the
+feasibility of integrating CNNs into clinical settings as an early detection
+tool, potentially complementing or replacing traditional methods. Despite
+significant progress, challenges remain, including data diversity, result
+interpretation, and ethical considerations. By identifying the best-performing
+CNN architectures and providing a comparative analysis, this study aims to
+contribute a comprehensive perspective on the application of CNNs in cancer
+detection and their role in advancing diagnostic capabilities in healthcare.
+
+
+
+
+
+
+
+ ♻ ☆ Locally Convex Global Loss Network for Decision-Focused Learning AAAI-25
+
+
+
+
+
+
+
+
+ Haeun Jeon, Hyunglip Bae, Minsu Park, Chanyeong Kim, Woo Chang Kim
+
+
+ In decision-making problems under uncertainty, predicting unknown parameters
+is often considered independent of the optimization part. Decision-focused
+learning (DFL) is a task-oriented framework that integrates prediction and
+optimization by adapting the predictive model to give better decisions for the
+corresponding task. Here, an inevitable challenge arises when computing the
+gradients of the optimal decision with respect to the parameters. Existing
+research copes with this issue by smoothly reforming surrogate optimization or
+constructing surrogate loss functions that mimic task loss. However, they are
+applied to restricted optimization domains. In this paper, we propose Locally
+Convex Global Loss Network (LCGLN), a global surrogate loss model that can be
+implemented in a general DFL paradigm. LCGLN learns task loss via a partial
+input convex neural network which is guaranteed to be convex for chosen inputs
+while keeping the non-convex global structure for the other inputs. This
+enables LCGLN to admit general DFL through only a single surrogate loss without
+any sense for choosing appropriate parametric forms. We confirm the
+effectiveness and flexibility of LCGLN by evaluating our proposed model with
+three stochastic decision-making problems.
+
+
+
+ comment: AAAI-25
+
+
+
+
+
+
+ ♻ ☆ Tackling Intertwined Data and Device Heterogeneities in Federated
+ Learning with Unlimited Staleness AAAI 2025
+
+
+ Federated Learning (FL) can be affected by data and device heterogeneities,
+caused by clients' different local data distributions and latencies in
+uploading model updates (i.e., staleness). Traditional schemes consider these
+heterogeneities as two separate and independent aspects, but this assumption is
+unrealistic in practical FL scenarios where these heterogeneities are
+intertwined. In these cases, traditional FL schemes are ineffective, and a
+better approach is to convert a stale model update into a unstale one. In this
+paper, we present a new FL framework that ensures the accuracy and
+computational efficiency of this conversion, hence effectively tackling the
+intertwined heterogeneities that may cause unlimited staleness in model
+updates. Our basic idea is to estimate the distributions of clients' local
+training data from their uploaded stale model updates, and use these
+estimations to compute unstale client model updates. In this way, our approach
+does not require any auxiliary dataset nor the clients' local models to be
+fully trained, and does not incur any additional computation or communication
+overhead at client devices. We compared our approach with the existing FL
+strategies on mainstream datasets and models, and showed that our approach can
+improve the trained model accuracy by up to 25% and reduce the number of
+required training epochs by up to 35%. Source codes can be found at:
+https://github.com/pittisl/FL-with-intertwined-heterogeneity.
+
+
+
+ comment: 22 pages. An abbreviated version is published at AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Unlocking Global Optimality in Bilevel Optimization: A Pilot Study
+
+
+ Bilevel optimization has witnessed a resurgence of interest, driven by its
+critical role in trustworthy and efficient AI applications. While many recent
+works have established convergence to stationary points or local minima,
+obtaining the global optimum of bilevel optimization remains an important yet
+open problem. The difficulty lies in the fact that, unlike many prior
+non-convex single-level problems, bilevel problems often do not admit a benign
+landscape, and may indeed have multiple spurious local solutions. Nevertheless,
+attaining global optimality is indispensable for ensuring reliability, safety,
+and cost-effectiveness, particularly in high-stakes engineering applications
+that rely on bilevel optimization. In this paper, we first explore the
+challenges of establishing a global convergence theory for bilevel
+optimization, and present two sufficient conditions for global convergence. We
+provide algorithm-dependent proofs to rigorously substantiate these sufficient
+conditions on two specific bilevel learning scenarios: representation learning
+and data hypercleaning (a.k.a. reweighting). Experiments corroborate the
+theoretical findings, demonstrating convergence to the global minimum in both
+cases.
+
+
+
+
+
+
+
+ ♻ ☆ Cross-Attention Graph Neural Networks for Inferring Gene Regulatory
+ Networks with Skewed Degree Distribution
+
+
+ Inferencing Gene Regulatory Networks (GRNs) from gene expression data is a
+pivotal challenge in systems biology, and several innovative computational
+methods have been introduced. However, most of these studies have not
+considered the skewed degree distribution of genes. Specifically, some genes
+may regulate multiple target genes while some genes may be regulated by
+multiple regulator genes. Such a skewed degree distribution issue significantly
+complicates the application of directed graph embedding methods. To tackle this
+issue, we propose the Cross-Attention Complex Dual Graph Embedding Model
+(XATGRN). Our XATGRN employs a cross-attention mechanism to effectively capture
+intricate gene interactions from gene expression profiles. Additionally, it
+uses a Dual Complex Graph Embedding approach to manage the skewed degree
+distribution, thereby ensuring precise prediction of regulatory relationships
+and their directionality. Our model consistently outperforms existing
+state-of-the-art methods across various datasets, underscoring its efficacy in
+elucidating complex gene regulatory mechanisms. Our codes used in this paper
+are publicly available at: https://github.com/kikixiong/XATGRN.
+
+
+
+ comment: 11 pages, 6 figures,1 tabels
+
+
+
+
+
+
+ ♻ ☆ Tacit Learning with Adaptive Information Selection for Cooperative
+ Multi-Agent Reinforcement Learning AAMAS 2025
+
+
+ In multi-agent reinforcement learning (MARL), the centralized training with
+decentralized execution (CTDE) framework has gained widespread adoption due to
+its strong performance. However, the further development of CTDE faces two key
+challenges. First, agents struggle to autonomously assess the relevance of
+input information for cooperative tasks, impairing their decision-making
+abilities. Second, in communication-limited scenarios with partial
+observability, agents are unable to access global information, restricting
+their ability to collaborate effectively from a global perspective. To address
+these challenges, we introduce a novel cooperative MARL framework based on
+information selection and tacit learning. In this framework, agents gradually
+develop implicit coordination during training, enabling them to infer the
+cooperative behavior of others in a discrete space without communication,
+relying solely on local information. Moreover, we integrate gating and
+selection mechanisms, allowing agents to adaptively filter information based on
+environmental changes, thereby enhancing their decision-making capabilities.
+Experiments on popular MARL benchmarks show that our framework can be
+seamlessly integrated with state-of-the-art algorithms, leading to significant
+performance improvements.
+
+
+
+ comment: Accepted by AAMAS 2025 (Extended Abstract)
+
+
+
+
+
+
+ ♻ ☆ Exploring Parameter-Efficient Fine-Tuning to Enable Foundation Models in
+ Federated Learning
+
+
+
+
+
+
+
+
+ Guangyu Sun, Umar Khalid, Matias Mendieta, Pu Wang, Chen Chen
+
+
+ Federated learning (FL) has emerged as a promising paradigm for enabling the
+collaborative training of models without centralized access to the raw data on
+local devices. In the typical FL paradigm (e.g., FedAvg), model weights are
+sent to and from the server each round to participating clients. Recently, the
+use of small pre-trained models has been shown to be effective in federated
+learning optimization and improving convergence. However, recent
+state-of-the-art pre-trained models are getting more capable but also have more
+parameters, known as the "Foundation Models." In conventional FL, sharing the
+enormous model weights can quickly put a massive communication burden on the
+system, especially if more capable models are employed. Can we find a solution
+to enable those strong and readily available pre-trained models in FL to
+achieve excellent performance while simultaneously reducing the communication
+burden? To this end, we investigate the use of parameter-efficient fine-tuning
+in federated learning and thus introduce a new framework: FedPEFT.
+Specifically, we systemically evaluate the performance of FedPEFT across a
+variety of client stability, data distribution, and differential privacy
+settings. By only locally tuning and globally sharing a small portion of the
+model weights, significant reductions in the total communication overhead can
+be achieved while maintaining competitive or even better performance in a wide
+range of federated learning scenarios, providing insight into a new paradigm
+for practical and effective federated systems.
+
+
+
+ comment: Published in 2024 IEEE International Conference on Big Data
+
+
+
+
+
+
+ ♻ ☆ Sparse-PGD: A Unified Framework for Sparse Adversarial Perturbations
+ Generation
+
+
+ This work studies sparse adversarial perturbations, including both
+unstructured and structured ones. We propose a framework based on a white-box
+PGD-like attack method named Sparse-PGD to effectively and efficiently generate
+such perturbations. Furthermore, we combine Sparse-PGD with a black-box attack
+to comprehensively and more reliably evaluate the models' robustness against
+unstructured and structured sparse adversarial perturbations. Moreover, the
+efficiency of Sparse-PGD enables us to conduct adversarial training to build
+robust models against various sparse perturbations. Extensive experiments
+demonstrate that our proposed attack algorithm exhibits strong performance in
+different scenarios. More importantly, compared with other robust models, our
+adversarially trained model demonstrates state-of-the-art robustness against
+various sparse attacks.
+
+
+
+ comment: Extended version. Codes are available at
+ https://github.com/CityU-MLO/sPGD
+
+
+
+
+
+
+ ♻ ☆ The Road to Artificial SuperIntelligence: A Comprehensive Survey of
+ Superalignment
+
+
+ The emergence of large language models (LLMs) has sparked the possibility of
+about Artificial Superintelligence (ASI), a hypothetical AI system surpassing
+human intelligence. However, existing alignment paradigms struggle to guide
+such advanced AI systems. Superalignment, the alignment of AI systems with
+human values and safety requirements at superhuman levels of capability aims to
+addresses two primary goals -- scalability in supervision to provide
+high-quality guidance signals and robust governance to ensure alignment with
+human values. In this survey, we examine scalable oversight methods and
+potential solutions for superalignment. Specifically, we explore the concept of
+ASI, the challenges it poses, and the limitations of current alignment
+paradigms in addressing the superalignment problem. Then we review scalable
+oversight methods for superalignment. Finally, we discuss the key challenges
+and propose pathways for the safe and continual improvement of ASI systems. By
+comprehensively reviewing the current literature, our goal is provide a
+systematical introduction of existing methods, analyze their strengths and
+limitations, and discuss potential future directions.
+
+
+
+
+
+
+
+ ♻ ☆ Flow Matching for Optimal Reaction Coordinates of Biomolecular System
+
+
+ We present flow matching for reaction coordinates (FMRC), a novel deep
+learning algorithm designed to identify optimal reaction coordinates (RC) in
+biomolecular reversible dynamics. FMRC is based on the mathematical principles
+of lumpability and decomposability, which we reformulate into a conditional
+probability framework for efficient data-driven optimization using deep
+generative models. While FMRC does not explicitly learn the well-established
+transfer operator or its eigenfunctions, it can effectively encode the dynamics
+of leading eigenfunctions of the system transfer operator into its
+low-dimensional RC space. We further quantitatively compare its performance
+with several state-of-the-art algorithms by evaluating the quality of Markov
+state models (MSM) constructed in their respective RC spaces, demonstrating the
+superiority of FMRC in three increasingly complex biomolecular systems. In
+addition, we successfully demonstrated the efficacy of FMRC for bias deposition
+in the enhanced sampling of a simple model system. Finally, we discuss its
+potential applications in downstream applications such as enhanced sampling
+methods and MSM construction.
+
+
+
+ comment: For Supporting Information, please see
+ https://pubs.acs.org/doi/full/10.1021/acs.jctc.4c01139
+
+ In this paper, we introduce the Diff-Instruct* (DI*), an image data-free
+approach for building one-step text-to-image generative models that align with
+human preference while maintaining the ability to generate highly realistic
+images. We frame human preference alignment as online reinforcement learning
+using human feedback (RLHF), where the goal is to maximize the reward function
+while regularizing the generator distribution to remain close to a reference
+diffusion process. Unlike traditional RLHF approaches, which rely on the KL
+divergence for regularization, we introduce a novel score-based divergence
+regularization, which leads to significantly better performances. Although the
+direct calculation of this preference alignment objective remains intractable,
+we demonstrate that we can efficiently compute its gradient by deriving an
+equivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to
+train a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step
+text-to-image model, which can generate images of a resolution of 1024x1024
+with only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference
+time and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly
+in PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1
+on Human Preference Score benchmark, establishing a new state-of-the-art
+benchmark of human-preferred 1-step text-to-image generative models. Besides
+the strong quantitative performances, extensive qualitative comparisons also
+confirm the advantages of DI* in terms of maintaining diversity, improving
+image layouts, and enhancing aesthetic colors. We have released our
+industry-ready model on the homepage:
+\url{https://github.com/pkulwj1994/diff_instruct_star}.
+
+
+
+ comment: revision: 2.6B 1-step text-to-image model outperforms 12B
+ Flux-dev-50step model in human preferences
+
+
+
+
+
+
+ ♻ ☆ Asymptotic Theory for IV-Based Reinforcement Learning with Potential
+ Endogeneity
+
+
+
+
+
+
+
+
+ Jin Li, Ye Luo, Zigan Wang, Xiaowei Zhang
+
+
+ In the standard data analysis framework, data is collected (once and for
+all), and then data analysis is carried out. However, with the advancement of
+digital technology, decision-makers constantly analyze past data and generate
+new data through their decisions. We model this as a Markov decision process
+and show that the dynamic interaction between data generation and data analysis
+leads to a new type of bias -- reinforcement bias -- that exacerbates the
+endogeneity problem in standard data analysis. We propose a class of instrument
+variable (IV)-based reinforcement learning (RL) algorithms to correct for the
+bias and establish their theoretical properties by incorporating them into a
+stochastic approximation (SA) framework. Our analysis accommodates
+iterate-dependent Markovian structures and, therefore, can be used to study RL
+algorithms with policy improvement. We also provide formulas for inference on
+optimal policies of the IV-RL algorithms. These formulas highlight how
+intertemporal dependencies of the Markovian environment affect the inference.
+
+
+ An established failure mode for machine learning models occurs when the same
+features are equally likely to belong to class 0 and class 1. In such cases,
+existing ML models cannot correctly classify the sample. However, a solvable
+case emerges when the probabilities of class 0 and 1 vary with the context
+distribution. To the best of our knowledge, standard neural network
+architectures like MLPs or CNNs are not equipped to handle this.
+ In this article, we propose a simple activation function, quantile activation
+(QACT), that addresses this problem without significantly increasing
+computational costs. The core idea is to adapt the outputs of each neuron to
+its context distribution. The proposed quantile activation, QACT, produces the
+relative quantile of the sample in its context distribution, rather than the
+actual values, as in traditional networks.
+ A practical example where the same sample can have different labels arises in
+cases of inherent distribution shift. We validate the proposed activation
+function under such shifts, using datasets designed to test robustness against
+distortions : CIFAR10C, CIFAR100C, MNISTC, TinyImagenetC. Our results
+demonstrate significantly better generalization across distortions compared to
+conventional classifiers, across various architectures. Although this paper
+presents a proof of concept, we find that this approach unexpectedly
+outperforms DINOv2 (small) under large distortions, despite DINOv2 being
+trained with a much larger network and dataset.
+
+
+
+
+
+
+
+ ♻ ☆ Adversarial Score identity Distillation: Rapidly Surpassing the Teacher
+ in One Step
+
+
+
+
+
+
+
+
+ Mingyuan Zhou, Huangjie Zheng, Yi Gu, Zhendong Wang, Hai Huang
+
+
+ Score identity Distillation (SiD) is a data-free method that has achieved
+SOTA performance in image generation by leveraging only a pretrained diffusion
+model, without requiring any training data. However, its ultimate performance
+is constrained by how accurate the pretrained model captures the true data
+scores at different stages of the diffusion process. In this paper, we
+introduce SiDA (SiD with Adversarial Loss), which not only enhances generation
+quality but also improves distillation efficiency by incorporating real images
+and adversarial loss. SiDA utilizes the encoder from the generator's score
+network as a discriminator, allowing it to distinguish between real images and
+those generated by SiD. The adversarial loss is batch-normalized within each
+GPU and then combined with the original SiD loss. This integration effectively
+incorporates the average "fakeness" per GPU batch into the pixel-based SiD
+loss, enabling SiDA to distill a single-step generator. SiDA converges
+significantly faster than its predecessor when distilled from scratch, and
+swiftly improves upon the original model's performance during fine-tuning from
+a pre-distilled SiD generator. This one-step adversarial distillation method
+establishes new benchmarks in generation performance when distilling EDM
+diffusion models, achieving FID scores of 1.110 on ImageNet 64x64. When
+distilling EDM2 models trained on ImageNet 512x512, our SiDA method surpasses
+even the largest teacher model, EDM2-XXL, which achieved an FID of 1.81 using
+classifier-free guidance (CFG) and 63 generation steps. In contrast, SiDA
+achieves FID scores of 2.156 for size XS, 1.669 for S, 1.488 for M, 1.413 for
+L, 1.379 for XL, and 1.366 for XXL, all without CFG and in a single generation
+step. These results highlight substantial improvements across all model sizes.
+Our code is available at https://github.com/mingyuanzhou/SiD/tree/sida.
+
+
+
+
+
+
+
+ ♻ ☆ Explainable AI for Multivariate Time Series Pattern Exploration: Latent
+ Space Visual Analytics with Temporal Fusion Transformer and Variational
+ Autoencoders in Power Grid Event Diagnosis
+
+
+
+
+
+
+
+
+ Haowen Xu, Ali Boyaci, Jianming Lian, Aaron Wilson
+
+
+ Detecting and analyzing complex patterns in multivariate time-series data is
+crucial for decision-making in urban and environmental system operations.
+However, challenges arise from the high dimensionality, intricate complexity,
+and interconnected nature of complex patterns, which hinder the understanding
+of their underlying physical processes. Existing AI methods often face
+limitations in interpretability, computational efficiency, and scalability,
+reducing their applicability in real-world scenarios. This paper proposes a
+novel visual analytics framework that integrates two generative AI models,
+Temporal Fusion Transformer (TFT) and Variational Autoencoders (VAEs), to
+reduce complex patterns into lower-dimensional latent spaces and visualize them
+in 2D using dimensionality reduction techniques such as PCA, t-SNE, and UMAP
+with DBSCAN. These visualizations, presented through coordinated and
+interactive views and tailored glyphs, enable intuitive exploration of complex
+multivariate temporal patterns, identifying patterns' similarities and uncover
+their potential correlations for a better interpretability of the AI outputs.
+The framework is demonstrated through a case study on power grid signal data,
+where it identifies multi-label grid event signatures, including faults and
+anomalies with diverse root causes. Additionally, novel metrics and
+visualizations are introduced to validate the models and evaluate the
+performance, efficiency, and consistency of latent maps generated by TFT and
+VAE under different configurations. These analyses provide actionable insights
+for model parameter tuning and reliability improvements. Comparative results
+highlight that TFT achieves shorter run times and superior scalability to
+diverse time-series data shapes compared to VAE. This work advances fault
+diagnosis in multivariate time series, fostering explainable AI to support
+critical system operations.
+
+
+ Consider a predictor, a learner, whose input is a stream of discrete items.
+The predictor's task, at every time point, is probabilistic multiclass
+prediction, i.e. to predict which item may occur next by outputting zero or
+more candidate items, each with a probability, after which the actual item is
+revealed and the predictor updates. To output probabilities, the predictor
+keeps track of the proportions of the items it has seen. The stream is
+unbounded (lifelong), and the predictor has finite limited space. The task is
+open-ended: the set of items is unknown to the predictor and their totality can
+also grow unbounded. Moreover, there is non-stationarity: the underlying
+frequencies of items may change, substantially, from time to time. For
+instance, new items may start appearing and a few recently frequent items may
+cease to occur again. The predictor, being space-bounded, need only provide
+probabilities for those items which, at the time of prediction, have
+sufficiently high frequency, i.e., the salient items. This problem is motivated
+in the setting of Prediction Games, a self-supervised learning regime where
+concepts serve as both the predictors and the predictands, and the set of
+concepts grows over time, resulting in non-stationarities as new concepts are
+generated and used. We design and study a number of predictors, sparse moving
+averages(SMAs), for the task. One SMA adapts the sparse exponentiated moving
+average and another is based on queuing a few counts, keeping dynamic per-item
+histories. Evaluating the predicted probabilities, under noise and
+non-stationarity, presents challenges, and we discuss and develop evaluation
+methods, one based on bounding log-loss. We show that a combination of ideas,
+supporting dynamic predictand-specific learning rates, offers advantages in
+terms of faster adaption to change (plasticity), while also supporting low
+variance (stability).
+
+
+
+ comment: 69 pages, 30 figures, 18 tables
+
+
+
+
+
+
+ ♻ ☆ MacLight: Multi-scene Aggregation Convolutional Learning for Traffic
+ Signal Control AAMAS2025
+
+
+ Reinforcement learning methods have proposed promising traffic signal control
+policy that can be trained on large road networks. Current SOTA methods model
+road networks as topological graph structures, incorporate graph attention into
+deep Q-learning, and merge local and global embeddings to improve policy.
+However, graph-based methods are difficult to parallelize, resulting in huge
+time overhead. Moreover, none of the current peer studies have deployed dynamic
+traffic systems for experiments, which is far from the actual situation.
+ In this context, we propose Multi-Scene Aggregation Convolutional Learning
+for traffic signal control (MacLight), which offers faster training speeds and
+more stable performance. Our approach consists of two main components. The
+first is the global representation, where we utilize variational autoencoders
+to compactly compress and extract the global representation. The second
+component employs the proximal policy optimization algorithm as the backbone,
+allowing value evaluation to consider both local features and global embedding
+representations. This backbone model significantly reduces time overhead and
+ensures stability in policy updates. We validated our method across multiple
+traffic scenarios under both static and dynamic traffic systems. Experimental
+results demonstrate that, compared to general and domian SOTA methods, our
+approach achieves superior stability, optimized convergence levels and the
+highest time efficiency. The code is under
+https://github.com/Aegis1863/MacLight.
+
+
+
+ comment: Accepted as full paper by AAMAS2025
+
+
+
+
+
+
+ ♻ ☆ The Numerical Stability of Hyperbolic Representation Learning
+
+
+
+
+
+
+
+
+ Gal Mishne, Zhengchao Wan, Yusu Wang, Sheng Yang
+
+
+ Given the exponential growth of the volume of the ball w.r.t. its radius, the
+hyperbolic space is capable of embedding trees with arbitrarily small
+distortion and hence has received wide attention for representing hierarchical
+datasets. However, this exponential growth property comes at a price of
+numerical instability such that training hyperbolic learning models will
+sometimes lead to catastrophic NaN problems, encountering unrepresentable
+values in floating point arithmetic. In this work, we carefully analyze the
+limitation of two popular models for the hyperbolic space, namely, the
+Poincar\'e ball and the Lorentz model. We first show that, under the 64 bit
+arithmetic system, the Poincar\'e ball has a relatively larger capacity than
+the Lorentz model for correctly representing points. Then, we theoretically
+validate the superiority of the Lorentz model over the Poincar\'e ball from the
+perspective of optimization. Given the numerical limitations of both models, we
+identify one Euclidean parametrization of the hyperbolic space which can
+alleviate these limitations. We further extend this Euclidean parametrization
+to hyperbolic hyperplanes and exhibits its ability in improving the performance
+of hyperbolic SVM.
+
+
+ Autoregressive (AR) models have achieved state-of-the-art performance in text
+and image generation but suffer from slow generation due to the token-by-token
+process. We ask an ambitious question: can a pre-trained AR model be adapted to
+generate outputs in just one or two steps? If successful, this would
+significantly advance the development and deployment of AR models. We notice
+that existing works that try to speed up AR generation by generating multiple
+tokens at once fundamentally cannot capture the output distribution due to the
+conditional dependencies between tokens, limiting their effectiveness for
+few-step generation. To address this, we propose Distilled Decoding (DD), which
+uses flow matching to create a deterministic mapping from Gaussian distribution
+to the output distribution of the pre-trained AR model. We then train a network
+to distill this mapping, enabling few-step generation. DD doesn't need the
+training data of the original AR model, making it more practical. We evaluate
+DD on state-of-the-art image AR models and present promising results on
+ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step
+generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19
+to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an
+217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In
+both cases, baseline methods completely fail with FID>100. DD also excels on
+text-to-image generation, reducing the generation from 256 steps to 2 for
+LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to
+demonstrate the possibility of one-step generation for image AR models, DD
+challenges the prevailing notion that AR models are inherently slow, and opens
+up new opportunities for efficient AR generation. The project website is at
+https://imagination-research.github.io/distilled-decoding.
+
+
+
+
+
+
+
+ ♻ ☆ ProCNS: Progressive Prototype Calibration and Noise Suppression for
+ Weakly-Supervised Medical Image Segmentation
+
+
+
+
+
+
+
+
+ Y. Liu, L. Lin, K. K. Y. Wong, X. Tang
+
+
+ Weakly-supervised segmentation (WSS) has emerged as a solution to mitigate
+the conflict between annotation cost and model performance by adopting sparse
+annotation formats (e.g., point, scribble, block, etc.). Typical approaches
+attempt to exploit anatomy and topology priors to directly expand sparse
+annotations into pseudo-labels. However, due to a lack of attention to the
+ambiguous edges in medical images and insufficient exploration of sparse
+supervision, existing approaches tend to generate erroneous and overconfident
+pseudo proposals in noisy regions, leading to cumulative model error and
+performance degradation. In this work, we propose a novel WSS approach, named
+ProCNS, encompassing two synergistic modules devised with the principles of
+progressive prototype calibration and noise suppression. Specifically, we
+design a Prototype-based Regional Spatial Affinity (PRSA) loss to maximize the
+pair-wise affinities between spatial and semantic elements, providing our model
+of interest with more reliable guidance. The affinities are derived from the
+input images and the prototype-refined predictions. Meanwhile, we propose an
+Adaptive Noise Perception and Masking (ANPM) module to obtain more enriched and
+representative prototype representations, which adaptively identifies and masks
+noisy regions within the pseudo proposals, reducing potential erroneous
+interference during prototype computation. Furthermore, we generate specialized
+soft pseudo-labels for the noisy regions identified by ANPM, providing
+supplementary supervision. Extensive experiments on six medical image
+segmentation tasks involving different modalities demonstrate that the proposed
+framework significantly outperforms representative state-of-the-art methods.
+
+
+
+
+
+
+
+
+ Xiaoyi Cai, James Queeney, Tong Xu, Aniket Datar, Chenhui Pan, Max Miller, Ashton Flather, Philip R. Osteen, Nicholas Roy, Xuesu Xiao, Jonathan P. How
+
+
+ Self-supervised learning is a powerful approach for developing traversability
+models for off-road navigation, but these models often struggle with inputs
+unseen during training. Existing methods utilize techniques like evidential
+deep learning to quantify model uncertainty, helping to identify and avoid
+out-of-distribution terrain. However, always avoiding out-of-distribution
+terrain can be overly conservative, e.g., when novel terrain can be effectively
+analyzed using a physics-based model. To overcome this challenge, we introduce
+Physics-Informed Evidential Traversability (PIETRA), a self-supervised learning
+framework that integrates physics priors directly into the mathematical
+formulation of evidential neural networks and introduces physics knowledge
+implicitly through an uncertainty-aware, physics-informed training loss. Our
+evidential network seamlessly transitions between learned and physics-based
+predictions for out-of-distribution inputs. Additionally, the physics-informed
+loss regularizes the learned model, ensuring better alignment with the physics
+model. Extensive simulations and hardware experiments demonstrate that PIETRA
+improves both learning accuracy and navigation performance in environments with
+significant distribution shifts.
+
+
+
+ comment: To appear in RA-L. Video: https://youtu.be/OTnNZ96oJRk
+
+ Accurately predicting the trajectory of vehicles is critically important for
+ensuring safety and reliability in autonomous driving. Although considerable
+research efforts have been made recently, the inherent trajectory uncertainty
+caused by various factors including the dynamic driving intends and the diverse
+driving scenarios still poses significant challenges to accurate trajectory
+prediction. To address this issue, we propose C2F-TP, a coarse-to-fine
+denoising framework for uncertainty-aware vehicle trajectory prediction. C2F-TP
+features an innovative two-stage coarse-to-fine prediction process.
+Specifically, in the spatial-temporal interaction stage, we propose a
+spatial-temporal interaction module to capture the inter-vehicle interactions
+and learn a multimodal trajectory distribution, from which a certain number of
+noisy trajectories are sampled. Next, in the trajectory refinement stage, we
+design a conditional denoising model to reduce the uncertainty of the sampled
+trajectories through a step-wise denoising operation. Extensive experiments are
+conducted on two real datasets NGSIM and highD that are widely adopted in
+trajectory prediction. The result demonstrates the effectiveness of our
+proposal.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and
+ Implementation
+
+
+ Clustering is a key task in machine learning, with $k$-means being widely
+used for its simplicity and effectiveness. While 1D clustering is common,
+existing methods often fail to exploit the structure of 1D data, leading to
+inefficiencies. This thesis introduces optimized algorithms for $k$-means++
+initialization and Lloyd's algorithm, leveraging sorted data, prefix sums, and
+binary search for improved computational performance. The main contributions
+are: (1) an optimized $k$-cluster algorithm achieving $O(l \cdot k^2 \cdot \log
+n)$ complexity for greedy $k$-means++ initialization and $O(i \cdot k \cdot
+\log n)$ for Lloyd's algorithm, where $l$ is the number of greedy $k$-means++
+local trials, and $i$ is the number of Lloyd's algorithm iterations, and (2) a
+binary search-based two-cluster algorithm, achieving $O(\log n)$ runtime with
+deterministic convergence to a Lloyd's algorithm local minimum. Benchmarks
+demonstrate over a 4500x speedup compared to scikit-learn for large datasets
+while maintaining clustering quality measured by within-cluster sum of squares
+(WCSS). Additionally, the algorithms achieve a 300x speedup in an LLM
+quantization task, highlighting their utility in emerging applications. This
+thesis bridges theory and practice for 1D $k$-means clustering, delivering
+efficient and sound algorithms implemented in a JIT-optimized open-source
+Python library.
+
+
+
+ comment: Undergraduate thesis, Department of Computer Science and Engineering,
+ Seoul National University. Minor revisions incorporated post-submission
+
+
+
+
+
+
+ ♻ ☆ The Effectiveness of Local Updates for Decentralized Learning under Data
+ Heterogeneity
+
+
+ We revisit two fundamental decentralized optimization methods, Decentralized
+Gradient Tracking (DGT) and Decentralized Gradient Descent (DGD), with multiple
+local updates. We consider two settings and demonstrate that incorporating
+local update steps can reduce communication complexity. Specifically, for
+$\mu$-strongly convex and $L$-smooth loss functions, we proved that local DGT
+achieves communication complexity {}{$\tilde{\mathcal{O}}
+\Big(\frac{L}{\mu(K+1)} + \frac{\delta + {}{\mu}}{\mu (1 - \rho)} + \frac{\rho
+}{(1 - \rho)^2} \cdot \frac{L+ \delta}{\mu}\Big)$}, %\zhize{seems to be
+$\tilde{\mathcal{O}}$} {where $K$ is the number of additional local update},
+$\rho$ measures the network connectivity and $\delta$ measures the second-order
+heterogeneity of the local losses. Our results reveal the tradeoff between
+communication and computation and show increasing $K$ can effectively reduce
+communication costs when the data heterogeneity is low and the network is
+well-connected. We then consider the over-parameterization regime where the
+local losses share the same minimums. We proved that employing local updates in
+DGD, even without gradient correction, achieves exact linear convergence under
+the Polyak-{\L}ojasiewicz (PL) condition, which can yield a similar effect as
+DGT in reducing communication complexity. {}{Customization of the result to
+linear models is further provided, with improved rate expression. }Numerical
+experiments validate our theoretical results.
+
+
+ Accurate prediction helps to achieve supply-demand balance in energy systems,
+supporting decision-making and scheduling. Traditional models, lacking
+AI-assisted automation, rely on experts, incur high costs, and struggle with
+sparse data prediction. To address these challenges, we propose the Energy
+Forecasting Large Language Model (EF-LLM), which integrates domain knowledge
+and temporal data for time-series forecasting, supporting both pre-forecast
+operations and post-forecast decision-support. EF-LLM's human-AI interaction
+capabilities lower the entry barrier in forecasting tasks, reducing the need
+for extra expert involvement. To achieve this, we propose a continual learning
+approach with updatable LoRA and a multi-channel architecture for aligning
+heterogeneous multimodal data, enabling EF-LLM to continually learn
+heterogeneous multimodal knowledge. In addition, EF-LLM enables accurate
+predictions under sparse data conditions through its ability to process
+multimodal data. We propose Fusion Parameter-Efficient Fine-Tuning (F-PEFT)
+method to effectively leverage both time-series data and text for this purpose.
+EF-LLM is also the first energy-specific LLM to detect hallucinations and
+quantify their occurrence rate, achieved via multi-task learning, semantic
+similarity analysis, and ANOVA. We have achieved success in energy prediction
+scenarios for load, photovoltaic, and wind power forecast.
+
+
+ Although quantization for linear layers has been widely used, its application
+to accelerate the attention process remains limited. To further enhance the
+efficiency of attention computation compared to SageAttention while maintaining
+precision, we propose SageAttention2, which utilizes significantly faster 4-bit
+matrix multiplication (Matmul) alongside additional precision-enhancing
+techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a
+hardware-friendly thread-level granularity and quantize matrixes $(\widetilde
+P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the
+accuracy of INT4 $QK$. Third, we propose to use an FP32 Matmul buffer for $PV$
+to enhance the accuracy of FP8 $\widetilde PV$. The operations per second (OPS)
+of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on
+RTX4090, respectively. Comprehensive experiments confirm that our approach
+incurs negligible end-to-end metrics loss across diverse models, including
+those for large language processing, image generation, and video generation.
+The codes are available at https://github.com/thu-ml/SageAttention.
+
+
+
+
+
+
+
+ ♻ ☆ SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference
+ Acceleration
+
+
+ The transformer architecture predominates across various models. As the heart
+of the transformer, attention has a computational complexity of O(N^2),
+compared to O(N) for linear transformations. When handling large sequence
+lengths, attention becomes the primary time-consuming component. Although
+quantization has proven to be an effective method for accelerating model
+inference, existing quantization methods primarily focus on optimizing the
+linear layer. In response, we first analyze the feasibility of quantization in
+attention detailedly. Following that, we propose SageAttention, a highly
+efficient and accurate quantization method for attention. The OPS (operations
+per second) of our approach outperforms FlashAttention2 and xformers by about
+2.1 times and 2.7 times, respectively. SageAttention also achieves superior
+accuracy performance over FlashAttention3. Comprehensive experiments confirm
+that our approach incurs almost no end-to-end metrics loss across diverse
+models, including those for large language processing, image generation, and
+video generation. The codes are available at
+https://github.com/thu-ml/SageAttention.
+
+
+
+
+
+
+
+ ♻ ☆ A Pioneering Neural Network Method for Efficient and Robust Fuel
+ Sloshing Simulation in Aircraft AAAI
+
+
+ Simulating fuel sloshing within aircraft tanks during flight is crucial for
+aircraft safety research. Traditional methods based on Navier-Stokes equations
+are computationally expensive. In this paper, we treat fluid motion as point
+cloud transformation and propose the first neural network method specifically
+designed for simulating fuel sloshing in aircraft. This model is also the deep
+learning model that is the first to be capable of stably modeling fluid
+particle dynamics in such complex scenarios. Our triangle feature fusion design
+achieves an optimal balance among fluid dynamics modeling, momentum
+conservation constraints, and global stability control. Additionally, we
+constructed the Fueltank dataset, the first dataset for aircraft fuel surface
+sloshing. It comprises 320,000 frames across four typical tank types and covers
+a wide range of flight maneuvers, including multi-directional rotations. We
+conducted comprehensive experiments on both our dataset and the take-off
+scenario of the aircraft. Compared to existing neural network-based fluid
+simulation algorithms, we significantly enhanced accuracy while maintaining
+high computational speed. Compared to traditional SPH methods, our speed
+improved approximately 10 times. Furthermore, compared to traditional fluid
+simulation software such as Flow3D, our computation speed increased by more
+than 300 times.
+
+
+
+ comment: This paper has been accepted by AAAI Conference on Artificial
+ Intelligence (AAAI-25)
+
+
+
+
+
+
+ ♻ ☆ Learning Mutual Excitation for Hand-to-Hand and Human-to-Human
+ Interaction Recognition
+
+
+
+
+
+
+
+
+ Mengyuan Liu, Chen Chen, Songtao Wu, Fanyang Meng, Hong Liu
+
+
+ Recognizing interactive actions, including hand-to-hand interaction and
+human-to-human interaction, has attracted increasing attention for various
+applications in the field of video analysis and human-robot interaction.
+Considering the success of graph convolution in modeling topology-aware
+features from skeleton data, recent methods commonly operate graph convolution
+on separate entities and use late fusion for interactive action recognition,
+which can barely model the mutual semantic relationships between pairwise
+entities. To this end, we propose a mutual excitation graph convolutional
+network (me-GCN) by stacking mutual excitation graph convolution (me-GC)
+layers. Specifically, me-GC uses a mutual topology excitation module to firstly
+extract adjacency matrices from individual entities and then adaptively model
+the mutual constraints between them. Moreover, me-GC extends the above idea and
+further uses a mutual feature excitation module to extract and merge deep
+features from pairwise entities. Compared with graph convolution, our proposed
+me-GC gradually learns mutual information in each layer and each stage of graph
+convolution operations. Extensive experiments on a challenging hand-to-hand
+interaction dataset, i.e., the Assembely101 dataset, and two large-scale
+human-to-human interaction datasets, i.e., NTU60-Interaction and
+NTU120-Interaction consistently verify the superiority of our proposed method,
+which outperforms the state-of-the-art GCN-based and Transformer-based methods.
+
+
+
+
+
+
+
+ ♻ ☆ Algorithm Design for Continual Learning in IoT Networks
+
+
+ Continual learning (CL) is a new online learning technique over sequentially
+generated streaming data from different tasks, aiming to maintain a small
+forgetting loss on previously-learned tasks. Existing work focuses on reducing
+the forgetting loss under a given task sequence. However, if similar tasks
+continuously appear to the end time, the forgetting loss is still huge on prior
+distinct tasks. In practical IoT networks, an autonomous vehicle to sample data
+and learn different tasks can route and alter the order of task pattern at
+increased travelling cost. To our best knowledge, we are the first to study how
+to opportunistically route the testing object and alter the task sequence in
+CL. We formulate a new optimization problem and prove it NP-hard. We propose a
+polynomial-time algorithm to achieve approximation ratios of $\frac{3}{2}$ for
+underparameterized case and $\frac{3}{2} + r^{1-T}$ for overparameterized case,
+respectively, where $r:=1-\frac{n}{m}$ is a parameter of feature number $m$ and
+sample number $n$ and $T$ is the task number. Simulation results verify our
+algorithm's close-to-optimum performance.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 6
+
+
+
+
+
+ ☆ DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion
+ Transformer for Tuning-Free Multi-Prompt Longer Video Generation
+
+
+ Sora-like video generation models have achieved remarkable progress with a
+Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current
+video generation models predominantly focus on single-prompt, struggling to
+generate coherent scenes with multiple sequential prompts that better reflect
+real-world dynamic scenarios. While some pioneering works have explored
+multi-prompt video generation, they face significant challenges including
+strict training data requirements, weak prompt following, and unnatural
+transitions. To address these problems, we propose DiTCtrl, a training-free
+multi-prompt video generation method under MM-DiT architectures for the first
+time. Our key idea is to take the multi-prompt video generation task as
+temporal video editing with smooth transitions. To achieve this goal, we first
+analyze MM-DiT's attention mechanism, finding that the 3D full attention
+behaves similarly to that of the cross/self-attention blocks in the UNet-like
+diffusion models, enabling mask-guided precise semantic control across
+different prompts with attention sharing for multi-prompt video generation.
+Based on our careful design, the video generated by DiTCtrl achieves smooth
+transitions and consistent object motion given multiple sequential prompts
+without additional training. Besides, we also present MPVBench, a new benchmark
+specially designed for multi-prompt video generation to evaluate the
+performance of multi-prompt generation. Extensive experiments demonstrate that
+our method achieves state-of-the-art performance without additional training.
+
+
+ Current conversational recommendation systems focus predominantly on text.
+However, real-world recommendation settings are generally multimodal, causing a
+significant gap between existing research and practical applications. To
+address this issue, we propose Muse, the first multimodal conversational
+recommendation dataset. Muse comprises 83,148 utterances from 7,000
+conversations centered around the Clothing domain. Each conversation contains
+comprehensive multimodal interactions, rich elements, and natural dialogues.
+Data in Muse are automatically synthesized by a multi-agent framework powered
+by multimodal large language models (MLLMs). It innovatively derives user
+profiles from real-world scenarios rather than depending on manual design and
+history data for better scalability, and then it fulfills conversation
+simulation and optimization. Both human and LLM evaluations demonstrate the
+high quality of conversations in Muse. Additionally, fine-tuning experiments on
+three MLLMs demonstrate Muse's learnable patterns for recommendations and
+responses, confirming its value for multimodal conversational recommendation.
+Our dataset and codes are available at
+\url{https://anonymous.4open.science/r/Muse-0086}.
+
+
+ Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach
+for high-fidelity image synthesis, operating diffusion processes on continuous
+VAE latent, which significantly differ from the text generation methods
+employed by Large Language Models (LLMs). In this paper, we introduce a novel
+generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which
+enhances the diffusion process through a recurrent token prediction mechanism,
+thereby pioneering the field of Discrete Diffusion. By progressively
+introducing Gaussian noise into the latent representations of images and
+encoding them into vector-quantized tokens in a recurrent manner, RDPM
+facilitates a unique diffusion process on discrete-value domains. This process
+iteratively predicts the token codes for subsequent timesteps, transforming the
+initial standard Gaussian noise into the source data distribution, aligning
+with GPT-style models in terms of the loss function. RDPM demonstrates superior
+performance while benefiting from the speed advantage of requiring only a few
+inference steps. This model not only leverages the diffusion process to ensure
+high-quality generation but also converts continuous signals into a series of
+high-fidelity discrete tokens, thereby maintaining a unified optimization
+strategy with other discrete tokens, such as text. We anticipate that this work
+will contribute to the development of a unified model for multimodal
+generation, specifically by integrating continuous signal domains such as
+images, videos, and audio with text. We will release the code and model weights
+to the open-source community.
+
+
+
+ comment: 8 pages
+
+
+
+
+
+
+ ♻ ☆ The Practice of Averaging Rate-Distortion Curves over Testsets to
+ Compare Learned Video Codecs Can Cause Misleading Conclusions
+
+
+
+
+
+
+
+
+ M. Akin Yilmaz, Onur Keleş, A. Murat Tekalp
+
+
+ This paper aims to demonstrate how the prevalent practice in the learned
+video compression community of averaging rate-distortion (RD) curves across a
+test video set can lead to misleading conclusions in evaluating codec
+performance. Through analytical analysis of a simple case and experimental
+results with two recent learned video codecs, we show how averaged RD curves
+can mislead comparative evaluation of different codecs, particularly when
+videos in a dataset have varying characteristics and operating ranges. We
+illustrate how a single video with distinct RD characteristics from the rest of
+the test set can disproportionately influence the average RD curve, potentially
+overshadowing a codec's superior performance across most individual sequences.
+Using two recent learned video codecs on the UVG dataset as a case study, we
+demonstrate computing performance metrics, such as the BD rate, from the
+average RD curve suggests conclusions that contradict those reached from
+calculating the average of per-sequence metrics. Hence, we argue that the
+learned video compression community should also report per-sequence RD curves
+and performance metrics for a test set should be computed from the average of
+per-sequence metrics, similar to the established practice in traditional video
+coding, to ensure fair and accurate codec comparisons.
+
+
+
+ comment: Submitted to IEEE Signal Processing Letters
+
+ In this paper, we introduce the Diff-Instruct* (DI*), an image data-free
+approach for building one-step text-to-image generative models that align with
+human preference while maintaining the ability to generate highly realistic
+images. We frame human preference alignment as online reinforcement learning
+using human feedback (RLHF), where the goal is to maximize the reward function
+while regularizing the generator distribution to remain close to a reference
+diffusion process. Unlike traditional RLHF approaches, which rely on the KL
+divergence for regularization, we introduce a novel score-based divergence
+regularization, which leads to significantly better performances. Although the
+direct calculation of this preference alignment objective remains intractable,
+we demonstrate that we can efficiently compute its gradient by deriving an
+equivalent yet tractable loss function. Remarkably, we used Diff-Instruct* to
+train a Stable Diffusion-XL-based 1-step model, the 2.6B DI*-SDXL-1step
+text-to-image model, which can generate images of a resolution of 1024x1024
+with only 1 generation step. DI*-SDXL-1step model uses only 1.88% inference
+time and 29.30% GPU memory cost to outperform 12B FLUX-dev-50step significantly
+in PickScore, ImageReward, and CLIPScore on Parti prompt benchmark and HPSv2.1
+on Human Preference Score benchmark, establishing a new state-of-the-art
+benchmark of human-preferred 1-step text-to-image generative models. Besides
+the strong quantitative performances, extensive qualitative comparisons also
+confirm the advantages of DI* in terms of maintaining diversity, improving
+image layouts, and enhancing aesthetic colors. We have released our
+industry-ready model on the homepage:
+\url{https://github.com/pkulwj1994/diff_instruct_star}.
+
+
+
+ comment: revision: 2.6B 1-step text-to-image model outperforms 12B
+ Flux-dev-50step model in human preferences
+
+
+
+
+
+
+ ♻ ☆ L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text
+ Compression
+
+
+
+
+
+
+
+
+ Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song
+
+
+ Learning-based probabilistic models can be combined with an entropy coder for
+data compression. However, due to the high complexity of learning-based models,
+their practical application as text compressors has been largely overlooked. To
+address this issue, our work focuses on a low-complexity design while
+maintaining compression performance. We introduce a novel Learned Lossless
+Low-complexity Text Compression method (L3TC). Specifically, we conduct
+extensive experiments demonstrating that RWKV models achieve the fastest
+decoding speed with a moderate compression ratio, making it the most suitable
+backbone for our method. Second, we propose an outlier-aware tokenizer that
+uses a limited vocabulary to cover frequent tokens while allowing outliers to
+bypass the prediction and encoding. Third, we propose a novel high-rank
+reparameterization strategy that enhances the learning capability during
+training without increasing complexity during inference. Experimental results
+validate that our method achieves 48% bit saving compared to gzip compressor.
+Besides, L3TC offers compression performance comparable to other learned
+compressors, with a 50x reduction in model parameters. More importantly, L3TC
+is the fastest among all learned compressors, providing real-time decoding
+speeds up to megabytes per second. Our code is available at
+https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression.git.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 64
+
+
+
+
+
+ ☆ Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with
+ Citations
+
+
+ Benchmarking modern large language models (LLMs) on complex and realistic
+tasks is critical to advancing their development. In this work, we evaluate the
+factual accuracy and citation performance of state-of-the-art LLMs on the task
+of Question Answering (QA) in ambiguous settings with source citations. Using
+three recently published datasets-DisentQA-DupliCite, DisentQA-ParaCite, and
+AmbigQA-Cite-featuring a range of real-world ambiguities, we analyze the
+performance of two leading LLMs, GPT-4o-mini and Claude-3.5. Our results show
+that larger, recent models consistently predict at least one correct answer in
+ambiguous contexts but fail to handle cases with multiple valid answers.
+Additionally, all models perform equally poorly in citation generation, with
+citation accuracy consistently at 0. However, introducing conflict-aware
+prompting leads to large improvements, enabling models to better address
+multiple valid answers and improve citation accuracy, while maintaining their
+ability to predict correct answers. These findings highlight the challenges and
+opportunities in developing LLMs that can handle ambiguity and provide reliable
+source citations. Our benchmarking study provides critical insights and sets a
+foundation for future improvements in trustworthy and interpretable QA systems.
+
+
+
+
+
+
+
+ ☆ Emoji Retrieval from Gibberish or Garbled Social Media Text: A Novel
+ Methodology and A Case Study
+
+
+ Emojis are widely used across social media platforms but are often lost in
+noisy or garbled text, posing challenges for data analysis and machine
+learning. Conventional preprocessing approaches recommend removing such text,
+risking the loss of emojis and their contextual meaning. This paper proposes a
+three-step reverse-engineering methodology to retrieve emojis from garbled text
+in social media posts. The methodology also identifies reasons for the
+generation of such text during social media data mining. To evaluate its
+effectiveness, the approach was applied to 509,248 Tweets about the Mpox
+outbreak, a dataset referenced in about 30 prior works that failed to retrieve
+emojis from garbled text. Our method retrieved 157,748 emojis from 76,914
+Tweets. Improvements in text readability and coherence were demonstrated
+through metrics such as Flesch Reading Ease, Flesch-Kincaid Grade Level,
+Coleman-Liau Index, Automated Readability Index, Dale-Chall Readability Score,
+Text Standard, and Reading Time. Additionally, the frequency of individual
+emojis and their patterns of usage in these Tweets were analyzed, and the
+results are presented.
+
+
+
+
+
+
+
+ ☆ Aligning AI Research with the Needs of Clinical Coding Workflows: Eight
+ Recommendations Based on US Data Analysis and Critical Review
+
+
+
+
+
+
+
+
+ Yidong Gan, Maciej Rybinski, Ben Hachey, Jonathan K. Kummerfeld
+
+
+ Clinical coding is crucial for healthcare billing and data analysis. Manual
+clinical coding is labour-intensive and error-prone, which has motivated
+research towards full automation of the process. However, our analysis, based
+on US English electronic health records and automated coding research using
+these records, shows that widely used evaluation methods are not aligned with
+real clinical contexts. For example, evaluations that focus on the top 50 most
+common codes are an oversimplification, as there are thousands of codes used in
+practice. This position paper aims to align AI coding research more closely
+with practical challenges of clinical coding. Based on our analysis, we offer
+eight specific recommendations, suggesting ways to improve current evaluation
+methods. Additionally, we propose new AI-based methods beyond automated coding,
+suggesting alternative approaches to assist clinical coders in their workflows.
+
+
+
+ comment: We received a meta-review score of 5 in ARR October 2024
+
+
+
+
+
+
+ ☆ Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based
+ Tensor Attention Transformers
+
+
+ Tensor Attention extends traditional attention mechanisms by capturing
+high-order correlations across multiple modalities, addressing the limitations
+of classical matrix-based attention. Meanwhile, Rotary Position Embedding
+($\mathsf{RoPE}$) has shown superior performance in encoding positional
+information in long-context scenarios, significantly enhancing transformer
+models' expressiveness. Despite these empirical successes, the theoretical
+limitations of these technologies remain underexplored. In this study, we
+analyze the circuit complexity of Tensor Attention and $\mathsf{RoPE}$-based
+Tensor Attention, showing that with polynomial precision, constant-depth
+layers, and linear or sublinear hidden dimension, they cannot solve fixed
+membership problems or $(A_{F,r})^*$ closure problems, under the assumption
+that $\mathsf{TC}^0 \neq \mathsf{NC}^1$. These findings highlight a gap between
+the empirical performance and theoretical constraints of Tensor Attention and
+$\mathsf{RoPE}$-based Tensor Attention Transformers, offering insights that
+could guide the development of more theoretically grounded approaches to
+Transformer model design and scaling.
+
+
+
+
+
+
+
+ ☆ Explainability in Neural Networks for Natural Language Processing Tasks
+
+
+ Neural networks are widely regarded as black-box models, creating significant
+challenges in understanding their inner workings, especially in natural
+language processing (NLP) applications. To address this opacity, model
+explanation techniques like Local Interpretable Model-Agnostic Explanations
+(LIME) have emerged as essential tools for providing insights into the behavior
+of these complex systems. This study leverages LIME to interpret a multi-layer
+perceptron (MLP) neural network trained on a text classification task. By
+analyzing the contribution of individual features to model predictions, the
+LIME approach enhances interpretability and supports informed decision-making.
+Despite its effectiveness in offering localized explanations, LIME has
+limitations in capturing global patterns and feature interactions. This
+research highlights the strengths and shortcomings of LIME and proposes
+directions for future work to achieve more comprehensive interpretability in
+neural NLP models.
+
+
+
+
+
+
+
+ ☆ Same Company, Same Signal: The Role of Identity in Earnings Call
+ Transcripts
+
+
+ Post-earnings volatility prediction is critical for investors, with previous
+works often leveraging earnings call transcripts under the assumption that
+their rich semantics contribute significantly. To further investigate how
+transcripts impact volatility, we introduce DEC, a dataset featuring accurate
+volatility calculations enabled by the previously overlooked beforeAfterMarket
+attribute and dense ticker coverage. Unlike established benchmarks, where each
+ticker has only around two earnings, DEC provides 20 earnings records per
+ticker. Using DEC, we reveal that post-earnings volatility undergoes
+significant shifts, with each ticker displaying a distinct volatility
+distribution. To leverage historical post-earnings volatility and capture
+ticker-specific patterns, we propose two training-free baselines: Post-earnings
+Volatility (PEV) and Same-ticker Post-earnings Volatility (STPEV). These
+baselines surpass all transcripts-based models on DEC as well as on established
+benchmarks. Additionally, we demonstrate that current transcript
+representations predominantly capture ticker identity rather than offering
+financially meaningful insights specific to each earnings. This is evidenced by
+two key observations: earnings representations from the same ticker exhibit
+significantly higher similarity compared to those from different tickers, and
+predictions from transcript-based models show strong correlations with prior
+post-earnings volatility.
+
+
+ The rapid development of large language models (LLMs) necessitates robust,
+unbiased, and scalable methods for evaluating their capabilities. However,
+human annotations are expensive to scale, model-based evaluations are prone to
+biases in answer style, while target-answer-based benchmarks are vulnerable to
+data contamination and cheating. To address these limitations, we propose
+StructTest, a novel benchmark that evaluates LLMs on their ability to produce
+compositionally specified structured outputs as an unbiased, cheap-to-run and
+difficult-to-cheat measure. The evaluation is done deterministically by a
+rule-based evaluator, which can be easily extended to new tasks. By testing
+structured outputs across diverse task domains -- including Summarization,
+Code, HTML and Math -- we demonstrate that StructTest serves as a good proxy
+for general reasoning abilities, as producing structured outputs often requires
+internal logical reasoning. We believe that StructTest offers a critical,
+complementary approach to objective and robust model evaluation.
+
+
+
+
+
+
+
+ ☆ Correctness is not Faithfulness in RAG Attributions
+
+
+
+
+
+
+
+
+ Jonas Wallat, Maria Heuss, Maarten de Rijke, Avishek Anand
+
+
+ Retrieving relevant context is a common approach to reduce hallucinations and
+enhance answer reliability. Explicitly citing source documents allows users to
+verify generated responses and increases trust. Prior work largely evaluates
+citation correctness - whether cited documents support the corresponding
+statements. But citation correctness alone is insufficient. To establish trust
+in attributed answers, we must examine both citation correctness and citation
+faithfulness. In this work, we first disentangle the notions of citation
+correctness and faithfulness, which have been applied inconsistently in
+previous studies. Faithfulness ensures that the model's reliance on cited
+documents is genuine, reflecting actual reference use rather than superficial
+alignment with prior beliefs, which we call post-rationalization. We design an
+experiment that reveals the prevalent issue of post-rationalization, which
+undermines reliable attribution and may result in misplaced trust. Our findings
+suggest that current attributed answers often lack citation faithfulness (up to
+57 percent of the citations), highlighting the need to evaluate correctness and
+faithfulness for trustworthy attribution in language models.
+
+
+
+ comment: 13 pages, 3 figures
+
+
+
+
+
+
+ ☆ CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language
+ Models
+
+
+ Causal reasoning capabilities are essential for large language models (LLMs)
+in a wide range of applications, such as education and healthcare. But there is
+still a lack of benchmarks for a better understanding of such capabilities.
+Current LLM benchmarks are mainly based on conversational tasks, academic math
+tests, and coding tests. Such benchmarks evaluate LLMs in well-regularized
+settings, but they are limited in assessing the skills and abilities to solve
+real-world problems. In this work, we provide a benchmark, named by CARL-GT,
+which evaluates CAusal Reasoning capabilities of large Language models using
+Graphs and Tabular data. The benchmark has a diverse range of tasks for
+evaluating LLMs from causal graph reasoning, knowledge discovery, and
+decision-making aspects. In addition, effective zero-shot learning prompts are
+developed for the tasks. In our experiments, we leverage the benchmark for
+evaluating open-source LLMs and provide a detailed comparison of LLMs for
+causal reasoning abilities. We found that LLMs are still weak in casual
+reasoning, especially with tabular data to discover new insights. Furthermore,
+we investigate and discuss the relationships of different benchmark tasks by
+analyzing the performance of LLMs. The experimental results show that LLMs have
+different strength over different tasks and that their performance on tasks in
+different categories, i.e., causal graph reasoning, knowledge discovery, and
+decision-making, shows stronger correlation than tasks in the same category.
+
+
+
+
+
+
+
+ ☆ Path-of-Thoughts: Extracting and Following Paths for Robust Relational
+ Reasoning with Large Language Models
+
+
+
+
+
+
+
+
+ Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, Jianye Hao
+
+
+ Large language models (LLMs) possess vast semantic knowledge but often
+struggle with complex reasoning tasks, particularly in relational reasoning
+problems such as kinship or spatial reasoning. In this paper, we present
+Path-of-Thoughts (PoT), a novel framework designed to tackle relation reasoning
+by decomposing the task into three key stages: graph extraction, path
+identification, and reasoning. Unlike previous approaches, PoT efficiently
+extracts a task-agnostic graph that identifies crucial entities, relations, and
+attributes within the problem context. Subsequently, PoT identifies relevant
+reasoning chains within the graph corresponding to the posed question,
+facilitating inference of potential answers. Experimental evaluations on four
+benchmark datasets, demanding long reasoning chains, demonstrate that PoT
+surpasses state-of-the-art baselines by a significant margin (maximum 21.3%)
+without necessitating fine-tuning or extensive LLM calls. Furthermore, as
+opposed to prior neuro-symbolic methods, PoT exhibits improved resilience
+against LLM errors by leveraging the compositional nature of graphs.
+
+
+
+
+
+
+
+ ☆ IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate
+ Speech Detection and Target Identification in Devanagari-Scripted Languages
+
+
+ This work focuses on two subtasks related to hate speech detection and target
+identification in Devanagari-scripted languages, specifically Hindi, Marathi,
+Nepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in
+online text, while Subtask C requires identifying the specific targets of hate
+speech, such as individuals, organizations, or communities. We propose the
+MultilingualRobertaClass model, a deep neural network built on the pretrained
+multilingual transformer model ia-multilingual-transliterated-roberta,
+optimized for classification tasks in multilingual and transliterated contexts.
+The model leverages contextualized embeddings to handle linguistic diversity,
+with a classifier head for binary classification. We received 88.40% accuracy
+in Subtask B and 66.11% accuracy in Subtask C, in the test set.
+
+
+
+
+
+
+
+ ☆ BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for
+ Large Language Models with Duel Scoring Mechanism
+
+
+
+
+
+
+
+
+ Martin Fajcik, Martin Docekal, Jan Dolezal, Karel Ondrej, Karel Beneš, Jan Kapsa, Pavel Smrz, Alexander Polok, Michal Hradis, Zuzana Neverilova, Ales Horak, Radoslav Sabol, Michal Stefanik, Adam Jirkovsky, David Adamczyk, Petr Hyner, Jan Hula, Hynek Kydlicek
+
+
+ We present BenCzechMark (BCM), the first comprehensive Czech language
+benchmark designed for large language models, offering diverse tasks, multiple
+task formats, and multiple evaluation metrics. Its scoring system is grounded
+in statistical significance theory and uses aggregation across tasks inspired
+by social preference theory. Our benchmark encompasses 50 challenging tasks,
+with corresponding test datasets, primarily in native Czech, with 11 newly
+collected ones. These tasks span 8 categories and cover diverse domains,
+including historical Czech news, essays from pupils or language learners, and
+spoken word.
+ Furthermore, we collect and clean BUT-Large Czech Collection, the largest
+publicly available clean Czech language corpus, and use it for (i)
+contamination analysis, (ii) continuous pretraining of the first Czech-centric
+7B language model, with Czech-specific tokenization. We use our model as a
+baseline for comparison with publicly available multilingual models. Lastly, we
+release and maintain a leaderboard, with existing 44 model submissions, where
+new model submissions can be made at
+https://huggingface.co/spaces/CZLC/BenCzechMark.
+
+
+
+
+
+
+
+
+ Filippos Bellos, Nam H. Nguyen, Jason J. Corso
+
+
+ Although LLMs have demonstrated remarkable capabilities in processing and
+generating textual data, their pre-trained vocabularies are ill-suited for
+capturing the nuanced temporal dynamics and patterns inherent in time series.
+The discrete, symbolic nature of natural language tokens, which these
+vocabularies are designed to represent, does not align well with the
+continuous, numerical nature of time series data. To address this fundamental
+limitation, we propose VITRO. Our method adapts textual inversion optimization
+from the vision-language domain in order to learn a new time series per-dataset
+vocabulary that bridges the gap between the discrete, semantic nature of
+natural language and the continuous, numerical nature of time series data. We
+show that learnable time series-specific pseudo-word embeddings represent time
+series data better than existing general language model vocabularies, with
+VITRO-enhanced methods achieving state-of-the-art performance in long-term
+forecasting across most datasets.
+
+
+
+ comment: Accepted to ICASSP 2025
+
+
+
+
+
+
+ ☆ A Multimodal Emotion Recognition System: Integrating Facial Expressions,
+ Body Movement, Speech, and Spoken Language
+
+
+ Traditional psychological evaluations rely heavily on human observation and
+interpretation, which are prone to subjectivity, bias, fatigue, and
+inconsistency. To address these limitations, this work presents a multimodal
+emotion recognition system that provides a standardised, objective, and
+data-driven tool to support evaluators, such as psychologists, psychiatrists,
+and clinicians. The system integrates recognition of facial expressions,
+speech, spoken language, and body movement analysis to capture subtle emotional
+cues that are often overlooked in human evaluations. By combining these
+modalities, the system provides more robust and comprehensive emotional state
+assessment, reducing the risk of mis- and overdiagnosis. Preliminary testing in
+a simulated real-world condition demonstrates the system's potential to provide
+reliable emotional insights to improve the diagnostic accuracy. This work
+highlights the promise of automated multimodal analysis as a valuable
+complement to traditional psychological evaluation practices, with applications
+in clinical and therapeutic settings.
+
+
+
+ comment: 10 pages, 6 figures, 3 tables
+
+
+
+
+
+
+ ☆ Cross-Lingual Text-Rich Visual Comprehension: An Information Theory
+ Perspective
+
+
+ Recent Large Vision-Language Models (LVLMs) have shown promising reasoning
+capabilities on text-rich images from charts, tables, and documents. However,
+the abundant text within such images may increase the model's sensitivity to
+language. This raises the need to evaluate LVLM performance on cross-lingual
+text-rich visual inputs, where the language in the image differs from the
+language of the instructions. To address this, we introduce XT-VQA
+(Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to
+assess how LVLMs handle language inconsistency between image text and
+questions. XT-VQA integrates five existing text-rich VQA datasets and a newly
+collected dataset, XPaperQA, covering diverse scenarios that require faithful
+recognition and comprehension of visual information despite language
+inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a
+significant drop in performance for cross-lingual scenarios, even for models
+with multilingual capabilities. A mutual information analysis suggests that
+this performance gap stems from cross-lingual questions failing to adequately
+activate relevant visual information. To mitigate this issue, we propose
+MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information),
+where a visual-text cross-lingual alignment is built by maximizing mutual
+information between the model's outputs and visual information. This is
+achieved by distilling knowledge from monolingual to cross-lingual settings
+through KL divergence minimization, where monolingual output logits serve as a
+teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI
+effectively reduces the visual-text cross-lingual performance disparity while
+preserving the inherent capabilities of LVLMs, shedding new light on the
+potential practice for improving LVLMs. Codes are available at:
+https://github.com/Stardust-y/XTVQA.git
+
+
+
+
+
+
+
+ ☆ ResearchTown: Simulator of Human Research Community
+
+
+ Large Language Models (LLMs) have demonstrated remarkable potential in
+scientific domains, yet a fundamental question remains unanswered: Can we
+simulate human research communities with LLMs? Addressing this question can
+deepen our understanding of the processes behind idea brainstorming and inspire
+the automatic discovery of novel scientific insights. In this work, we propose
+ResearchTown, a multi-agent framework for research community simulation. Within
+this framework, the human research community is simplified and modeled as an
+agent-data graph, where researchers and papers are represented as agent-type
+and data-type nodes, respectively, and connected based on their collaboration
+relationships. We also introduce TextGNN, a text-based inference framework that
+models various research activities (e.g., paper reading, paper writing, and
+review writing) as special forms of a unified message-passing process on the
+agent-data graph. To evaluate the quality of the research simulation, we
+present ResearchBench, a benchmark that uses a node-masking prediction task for
+scalable and objective assessment based on similarity. Our experiments reveal
+three key findings: (1) ResearchTown can provide a realistic simulation of
+collaborative research activities, including paper writing and review writing;
+(2) ResearchTown can maintain robust simulation with multiple researchers and
+diverse papers; (3) ResearchTown can generate interdisciplinary research ideas
+that potentially inspire novel research directions.
+
+
+
+
+
+
+
+ ☆ In Case You Missed It: ARC 'Challenge' Is Not That Challenging
+
+
+ ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily
+due to an evaluation setup that prevents direct comparison of answer choices
+rather than inherent complexity. Although some researchers have quietly shifted
+to a more appropriate scheme over the last year, the implications of this
+change have yet to be widely acknowledged. We highlight this overlooked shift,
+show how similar evaluation practices falsely imply reasoning deficits in other
+benchmarks, and demonstrate that fairer methods dramatically reduce performance
+gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing
+so, we reveal how evaluation shapes perceived difficulty and offer guidelines
+to ensure that multiple-choice evaluations accurately reflect actual model
+capabilities.
+
+
+
+
+
+
+
+ ☆ Deliberation in Latent Space via Differentiable Cache Augmentation
+
+
+
+
+
+
+
+
+ Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam
+
+
+ Techniques enabling large language models (LLMs) to "think more" by
+generating and attending to intermediate reasoning steps have shown promise in
+solving complex problems. However, the standard approaches generate sequences
+of discrete tokens immediately before responding, and so they can incur
+significant latency costs and be challenging to optimize. In this work, we
+demonstrate that a frozen LLM can be augmented with an offline coprocessor that
+operates on the model's key-value (kv) cache. This coprocessor augments the
+cache with a set of latent embeddings designed to improve the fidelity of
+subsequent decoding. We train this coprocessor using the language modeling loss
+from the decoder on standard pretraining data, while keeping the decoder itself
+frozen. This approach enables the model to learn, in an end-to-end
+differentiable fashion, how to distill additional computation into its
+kv-cache. Because the decoder remains unchanged, the coprocessor can operate
+offline and asynchronously, and the language model can function normally if the
+coprocessor is unavailable or if a given cache is deemed not to require extra
+computation. We show experimentally that when a cache is augmented, the decoder
+achieves lower perplexity on numerous subsequent tokens. Furthermore, even
+without any task-specific training, our experiments demonstrate that cache
+augmentation consistently reduces perplexity and improves performance across a
+range of reasoning-intensive tasks.
+
+
+
+
+
+
+
+ ☆ RepoTransBench: A Real-World Benchmark for Repository-Level Code
+ Translation
+
+
+ Repository-level code translation refers to translating an entire code
+repository from one programming language to another while preserving the
+functionality of the source repository. Many benchmarks have been proposed to
+evaluate the performance of such code translators. However, previous benchmarks
+mostly provide fine-grained samples, focusing at either code snippet, function,
+or file-level code translation. Such benchmarks do not accurately reflect
+real-world demands, where entire repositories often need to be translated,
+involving longer code length and more complex functionalities. To address this
+gap, we propose a new benchmark, named RepoTransBench, which is a real-world
+repository-level code translation benchmark with an automatically executable
+test suite. We conduct experiments on RepoTransBench to evaluate the
+translation performance of 11 advanced LLMs. We find that the Success@1 score
+(test success in one attempt) of the best-performing LLM is only 7.33%. To
+further explore the potential of LLMs for repository-level code translation, we
+provide LLMs with error-related feedback to perform iterative debugging and
+observe an average 7.09% improvement on Success@1. However, even with this
+improvement, the Success@1 score of the best-performing LLM is only 21%, which
+may not meet the need for reliable automatic repository-level code translation.
+Finally, we conduct a detailed error analysis and highlight current LLMs'
+deficiencies in repository-level code translation, which could provide a
+reference for further improvements.
+
+
+
+
+
+
+
+ ☆ Fourier Position Embedding: Enhancing Attention's Periodic Extension for
+ Length Generalization
+
+
+
+
+
+
+
+
+ Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Ning Ding, Youbang Sun, Biqing Qi, Yuchen Fan, Xue Kai Zhu, Bowen Zhou
+
+
+ Extending the context length of Language Models (LMs) by improving Rotary
+Position Embedding (RoPE) has become a trend. While existing works mainly
+address RoPE's limitations within attention mechanism, this paper provides an
+analysis across nearly all parts of LMs, uncovering their adverse effects on
+length generalization for RoPE-based attention. Using Discrete Signal
+Processing theory, we show that RoPE enables periodic attention by implicitly
+achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is
+undermined by the spectral damage caused by: 1) linear layers and activation
+functions outside of attention; 2) insufficiently trained frequency components
+brought by time-domain truncation. Building on our observations, we propose
+Fourier Position Embedding (FoPE), which enhances attention's frequency-domain
+properties to improve both its periodic extension and length generalization.
+FoPE constructs Fourier Series and zero-outs the destructive frequency
+components, increasing model robustness against the spectrum damage.
+Experiments across various model scales show that, within varying context
+windows, FoPE can maintain a more stable perplexity and a more consistent
+accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several
+analyses and ablations bring further support to our method and theoretical
+modeling.
+
+
+
+ comment: 14 pages, 7 figures
+
+
+
+
+
+
+ ☆ Chumor 2.0: Towards Benchmarking Chinese Humor Understanding
+
+
+
+
+
+
+
+
+ Ruiqi He, Yushu He, Longju Bai, Jiarui Liu, Zhenjie Sun, Zenghao Tang, He Wang, Hanchen Xia, Rada Mihalcea, Naihao Deng
+
+
+ Existing humor datasets and evaluations predominantly focus on English,
+leaving limited resources for culturally nuanced humor in non-English languages
+like Chinese. To address this gap, we construct Chumor, the first Chinese humor
+explanation dataset that exceeds the size of existing humor datasets. Chumor is
+sourced from Ruo Zhi Ba, a Chinese Reddit-like platform known for sharing
+intellectually challenging and culturally specific jokes. We test ten LLMs
+through direct and chain-of-thought prompting, revealing that Chumor poses
+significant challenges to existing LLMs, with their accuracy slightly above
+random and far below human. In addition, our analysis highlights that
+human-annotated humor explanations are significantly better than those
+generated by GPT-4o and ERNIE-4-turbo. We release Chumor at
+https://huggingface.co/datasets/dnaihao/Chumor, our project page is at
+https://dnaihao.github.io/Chumor-dataset/, our leaderboard is at
+https://huggingface.co/spaces/dnaihao/Chumor, and our codebase is at
+https://github.com/dnaihao/Chumor-dataset.
+
+
+
+ comment: arXiv admin note: substantial text overlap with arXiv:2406.12754
+
+
+
+
+
+
+
+ Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
+
+
+ Large Language Models (LLMs) have demonstrated exceptional capabilities
+across a wide range of natural language processing (NLP) tasks. However,
+keeping these models up-to-date with evolving world knowledge remains a
+significant challenge due to the high costs of frequent retraining. To address
+this challenge, knowledge editing techniques have emerged to update LLMs with
+new information without rebuilding the model from scratch. Among these, the
+in-context editing paradigm stands out for its effectiveness in integrating new
+knowledge while preserving the model's original capabilities. Despite its
+potential, existing in-context knowledge editing methods are often
+task-specific, focusing primarily on multi-hop QA tasks using structured
+knowledge triples. Moreover, their reliance on few-shot prompting for task
+decomposition makes them unstable and less effective in generalizing across
+diverse tasks.
+ In response to these limitations, we propose EditCoT, a novel knowledge
+editing framework that flexibly and efficiently updates LLMs across various
+tasks without retraining. EditCoT works by generating a chain-of-thought (CoT)
+for a given input and then iteratively refining this CoT process using a CoT
+editor based on updated knowledge. We evaluate EditCoT across a diverse range
+of benchmarks, covering multiple languages and tasks. The results demonstrate
+that our approach achieves state-of-the-art performance while offering superior
+generalization, effectiveness, and stability compared to existing methods,
+marking a significant advancement in the field of knowledge updating. Code and
+data are available at: https://github.com/bebr2/EditCoT.
+
+
+
+
+
+
+
+ ☆ Understanding the Logic of Direct Preference Alignment through Logic
+
+
+ Recent direct preference alignment algorithms (DPA), such as DPO, have shown
+great promise in aligning large language models to human preferences. While
+this has motivated the development of many new variants of the original DPO
+loss, understanding the differences between these recent proposals, as well as
+developing new DPA loss functions, remains difficult given the lack of a
+technical and conceptual framework for reasoning about the underlying semantics
+of these algorithms. In this paper, we attempt to remedy this by formalizing
+DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given
+an existing DPA loss, can we systematically derive a symbolic expression that
+characterizes its semantics? How do the semantics of two losses relate to each
+other? We propose a novel formalism for characterizing preference losses for
+single model and reference model based approaches, and identify symbolic forms
+for a number of commonly used DPA variants. Further, we show how this formal
+view of preference learning sheds new light on both the size and structure of
+the DPA loss landscape, making it possible to not only rigorously characterize
+the relationships between recent loss proposals but also to systematically
+explore the landscape and derive new loss functions from first principles. We
+hope our framework and findings will help provide useful guidance to those
+working on human AI alignment.
+
+
+
+
+
+
+
+ ☆ Large Language Model Safety: A Holistic Survey
+
+
+ The rapid development and deployment of large language models (LLMs) have
+introduced a new frontier in artificial intelligence, marked by unprecedented
+capabilities in natural language understanding and generation. However, the
+increasing integration of these models into critical applications raises
+substantial safety concerns, necessitating a thorough examination of their
+potential risks and associated mitigation strategies.
+ This survey provides a comprehensive overview of the current landscape of LLM
+safety, covering four major categories: value misalignment, robustness to
+adversarial attacks, misuse, and autonomous AI risks. In addition to the
+comprehensive review of the mitigation methodologies and evaluation resources
+on these four aspects, we further explore four topics related to LLM safety:
+the safety implications of LLM agents, the role of interpretability in
+enhancing LLM safety, the technology roadmaps proposed and abided by a list of
+AI companies and institutes for LLM safety, and AI governance aimed at LLM
+safety with discussions on international cooperation, policy proposals, and
+prospective regulatory directions.
+ Our findings underscore the necessity for a proactive, multifaceted approach
+to LLM safety, emphasizing the integration of technical solutions, ethical
+considerations, and robust governance frameworks. This survey is intended to
+serve as a foundational resource for academy researchers, industry
+practitioners, and policymakers, offering insights into the challenges and
+opportunities associated with the safe integration of LLMs into society.
+Ultimately, it seeks to contribute to the safe and beneficial development of
+LLMs, aligning with the overarching goal of harnessing AI for societal
+advancement and well-being. A curated list of related papers has been publicly
+available at https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers.
+
+
+
+ comment: 158 pages, 18 figures
+
+
+
+
+
+
+ ☆ Generating Completions for Fragmented Broca's Aphasic Sentences Using
+ Large Language Models
+
+
+
+
+
+
+
+
+ Sijbren van Vaals, Yevgen Matusevych, Frank Tsiwah
+
+
+ Broca's aphasia is a type of aphasia characterized by non-fluent, effortful
+and fragmented speech production with relatively good comprehension. Since
+traditional aphasia treatment methods are often time-consuming,
+labour-intensive, and do not reflect real-world conversations, applying natural
+language processing based approaches such as Large Language Models (LLMs) could
+potentially contribute to improving existing treatment approaches. To address
+this issue, we explore the use of sequence-to-sequence LLMs for completing
+fragmented Broca's aphasic sentences. We first generate synthetic Broca's
+aphasic data using a rule-based system designed to mirror the linguistic
+characteristics of Broca's aphasic speech. Using this synthetic data, we then
+fine-tune four pre-trained LLMs on the task of completing fragmented sentences.
+We evaluate our fine-tuned models on both synthetic and authentic Broca's
+aphasic data. We demonstrate LLMs' capability for reconstructing fragmented
+sentences, with the models showing improved performance with longer input
+utterances. Our result highlights the LLMs' potential in advancing
+communication aids for individuals with Broca's aphasia and possibly other
+clinical populations.
+
+
+
+
+
+
+
+ ☆ The Power of Adaptation: Boosting In-Context Learning through Adaptive
+ Prompting
+
+
+ Large Language Models (LLMs) have demonstrated exceptional abilities across a
+broad range of language-related tasks, including generating solutions to
+complex reasoning problems. An effective technique to enhance LLM performance
+is in-context learning, which encourages a step-by-step reasoning process by
+including explanatory examples to guide the model's responses. However,
+selecting appropriate exemplars for the model poses a challenge, as each
+dataset demands a distinct set of exemplars to enable the LLM to learn
+effectively and perform well on the test set. Current studies often rely on
+uncertainty- or diversity-based selection strategies to select exemplars for
+annotation and to improve model learning. However, these studies typically
+employ a non-adaptive approach, selecting a set of exemplars all at once. We
+argue that this non-adaptive strategy may result in a set of exemplars with
+high redundancy in terms of the knowledge covered, ultimately reducing their
+overall informativeness. To address this limitation, we propose
+\textsc{Adaptive-Prompt}, a novel method that adaptively selects exemplars by
+leveraging model feedback from previously chosen exemplars. Experimental
+results show that \textsc{Adaptive-Prompt} significantly enhances LLM
+performance across a variety of reasoning tasks.
+
+
+
+
+
+
+
+ ☆ Tracking the Feature Dynamics in LLM Training: A Mechanistic Study
+
+
+ Understanding training dynamics and feature evolution is crucial for the
+mechanistic interpretability of large language models (LLMs). Although sparse
+autoencoders (SAEs) have been used to identify features within LLMs, a clear
+picture of how these features evolve during training remains elusive. In this
+study, we: (1) introduce SAE-Track, a method to efficiently obtain a continual
+series of SAEs; (2) formulate the process of feature formation and conduct a
+mechanistic analysis; and (3) analyze and visualize feature drift during
+training. Our work provides new insights into the dynamics of features in LLMs,
+enhancing our understanding of training mechanisms and feature evolution.
+
+
+
+
+
+
+
+ ☆ LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea
+ Generation with Minimal Context
+
+
+
+
+
+
+
+
+ Kai Ruan, Xuan Wang, Jixiang Hong, Hao Sun
+
+
+ While Large Language Models (LLMs) have demonstrated remarkable capabilities
+in scientific tasks, existing evaluation frameworks primarily assess their
+performance using rich contextual inputs, overlooking their ability to generate
+novel ideas from minimal information. We introduce LiveIdeaBench, a
+comprehensive benchmark that evaluates LLMs' scientific creativity and
+divergent thinking capabilities using single-keyword prompts. Drawing from
+Guilford's creativity theory, our framework employs a dynamic panel of
+state-of-the-art LLMs to assess generated ideas across four key dimensions:
+originality, feasibility, fluency, and flexibility. Through extensive
+experimentation with 20 leading models across 1,180 keywords spanning 18
+scientific domains, we reveal that scientific creative ability shows distinct
+patterns from general intelligence metrics. Notably, our results demonstrate
+that models like QwQ-32B-preview achieve comparable creative performance to
+top-tier models like o1-preview, despite significant gaps in their general
+intelligence scores. These findings highlight the importance of specialized
+evaluation frameworks for scientific creativity and suggest that the
+development of creative capabilities in LLMs may follow different trajectories
+than traditional problem-solving abilities.
+
+
+ Transformer architectures are increasingly effective at processing and
+generating very long chunks of texts, opening new perspectives for
+document-level machine translation (MT). In this work, we challenge the ability
+of MT systems to handle texts comprising up to several thousands of tokens. We
+design and implement a new approach designed to precisely measure the effect of
+length increments on MT outputs. Our experiments with two representative
+architectures unambiguously show that (a)~translation performance decreases
+with the length of the input text; (b)~the position of sentences within the
+document matters and translation quality is higher for sentences occurring
+earlier in a document. We further show that manipulating the distribution of
+document lengths and of positional embeddings only marginally mitigates such
+problems. Our results suggest that even though document-level MT is
+computationally feasible, it does not yet match the performance of
+sentence-based MT.
+
+
+
+ comment: Under review
+
+
+
+
+
+
+ ☆ ERUPD -- English to Roman Urdu Parallel Dataset
+
+
+
+
+
+
+
+
+ Mohammed Furqan, Raahid Bin Khaja, Rayyan Habeeb
+
+
+ Bridging linguistic gaps fosters global growth and cultural exchange. This
+study addresses the challenges of Roman Urdu -- a Latin-script adaptation of
+Urdu widely used in digital communication -- by creating a novel parallel
+dataset comprising 75,146 sentence pairs. Roman Urdu's lack of standardization,
+phonetic variability, and code-switching with English complicates language
+processing. We tackled this by employing a hybrid approach that combines
+synthetic data generated via advanced prompt engineering with real-world
+conversational data from personal messaging groups. We further refined the
+dataset through a human evaluation phase, addressing linguistic inconsistencies
+and ensuring accuracy in code-switching, phonetic representations, and synonym
+variability. The resulting dataset captures Roman Urdu's diverse linguistic
+features and serves as a critical resource for machine translation, sentiment
+analysis, and multilingual education.
+
+
+
+ comment: 9 pages, 1 figure
+
+
+
+
+
+
+ ☆ A Survey of Query Optimization in Large Language Models
+
+
+ \textit{Query Optimization} (QO) refers to techniques aimed at enhancing the
+efficiency and quality of Large Language Models (LLMs) in understanding and
+answering queries, especially complex ones in scenarios like
+Retrieval-Augmented Generation (RAG). Specifically, RAG mitigates the
+limitations of LLMs by dynamically retrieving and leveraging up-to-date
+relevant information, which provides a cost-effective solution to the challenge
+of LLMs producing plausible but potentially inaccurate responses. Recently, as
+RAG evolves and incorporates multiple components that influence its
+performance, QO has emerged as a critical element, playing a pivotal role in
+determining the effectiveness of RAG's retrieval stage in accurately sourcing
+the necessary multiple pieces of evidence to answer queries correctly. In this
+paper, we trace the evolution of QO techniques by summarizing and analyzing
+significant studies. Through an organized framework and categorization, we aim
+to consolidate existing QO techniques in RAG, elucidate their technological
+foundations, and highlight their potential to enhance the versatility and
+applications of LLMs.
+
+
+
+ comment: Ongoing Work
+
+
+
+
+
+
+ ☆ Comparative Analysis of Document-Level Embedding Methods for Similarity
+ Scoring on Shakespeare Sonnets and Taylor Swift Lyrics
+
+
+ This study evaluates the performance of TF-IDF weighting, averaged Word2Vec
+embeddings, and BERT embeddings for document similarity scoring across two
+contrasting textual domains. By analysing cosine similarity scores, the
+methods' strengths and limitations are highlighted. The findings underscore
+TF-IDF's reliance on lexical overlap and Word2Vec's superior semantic
+generalisation, particularly in cross-domain comparisons. BERT demonstrates
+lower performance in challenging domains, likely due to insufficient
+domainspecific fine-tuning.
+
+
+
+ comment: 9 pages, 4 figures
+
+
+
+
+
+
+ ☆ Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and
+ Multi-Domain Testing
+
+
+ This paper presents a novel approach to fine-tuning the Qwen2-1.5B model for
+Arabic language processing using Quantized Low-Rank Adaptation (QLoRA) on a
+system with only 4GB VRAM. We detail the process of adapting this large
+language model to the Arabic domain, using diverse datasets including Bactrian,
+OpenAssistant, and Wikipedia Arabic corpora. Our methodology involves custom
+data preprocessing, model configuration, and training optimization techniques
+such as gradient accumulation and mixed-precision training. We address specific
+challenges in Arabic NLP, including morphological complexity, dialectal
+variations, and diacritical mark handling. Experimental results over 10,000
+training steps show significant performance improvements, with the final loss
+converging to 0.1083. We provide comprehensive analysis of GPU memory usage,
+training dynamics, and model evaluation across various Arabic language tasks,
+including text classification, question answering, and dialect identification.
+The fine-tuned model demonstrates robustness to input perturbations and
+improved handling of Arabic-specific linguistic phenomena. This research
+contributes to multilingual AI by demonstrating a resource-efficient approach
+for creating specialized language models, potentially democratizing access to
+advanced NLP technologies for diverse linguistic communities. Our work paves
+the way for future research in low-resource language adaptation and efficient
+fine-tuning of large language models.
+
+
+
+
+
+
+
+ ☆ Domain adapted machine translation: What does catastrophic forgetting
+ forget and why? EMNLP 2024
+
+
+ Neural Machine Translation (NMT) models can be specialized by domain
+adaptation, often involving fine-tuning on a dataset of interest. This process
+risks catastrophic forgetting: rapid loss of generic translation quality.
+Forgetting has been widely observed, with many mitigation methods proposed.
+However, the causes of forgetting and the relationship between forgetting and
+adaptation data are under-explored.
+ This paper takes a novel approach to understanding catastrophic forgetting
+during NMT adaptation by investigating the impact of the data. We provide a
+first investigation of what is forgotten, and why. We examine the relationship
+between forgetting and the in-domain data, and show that the amount and type of
+forgetting is linked to that data's target vocabulary coverage. Our findings
+pave the way toward better informed NMT domain adaptation.
+
+
+
+ comment: EMNLP 2024
+
+
+
+
+
+
+ ☆ CiteBART: Learning to Generate Citations for Local Citation
+ Recommendation
+
+
+ Citations are essential building blocks in scientific writing. The scientific
+community is longing for support in their generation. Citation generation
+involves two complementary subtasks: Determining the citation worthiness of a
+context and, if it's worth it, proposing the best candidate papers for the
+citation placeholder. The latter subtask is called local citation
+recommendation (LCR). This paper proposes CiteBART, a custom BART pre-training
+based on citation token masking to generate citations to achieve LCR. In the
+base scheme, we mask the citation token in the local citation context to make
+the citation prediction. In the global one, we concatenate the citing paper's
+title and abstract to the local citation context to learn to reconstruct the
+citation token. CiteBART outperforms state-of-the-art approaches on the
+citation recommendation benchmarks except for the smallest FullTextPeerRead
+dataset. The effect is significant in the larger benchmarks, e.g., Refseer and
+ArXiv. We present a qualitative analysis and an ablation study to provide
+insights into the workings of CiteBART. Our analyses confirm that its
+generative nature brings about a zero-shot capability.
+
+
+
+ comment: 15 pages, 2 figures, 7 tables
+
+
+
+
+
+
+ ☆ Behind Closed Words: Creating and Investigating the forePLay Annotated
+ Dataset for Polish Erotic Discourse
+
+
+
+
+
+
+
+
+ Anna Kołos, Katarzyna Lorenc, Emilia Wiśnios, Agnieszka Karlińska
+
+
+ The surge in online content has created an urgent demand for robust detection
+systems, especially in non-English contexts where current tools demonstrate
+significant limitations. We present forePLay, a novel Polish language dataset
+for erotic content detection, featuring over 24k annotated sentences with a
+multidimensional taxonomy encompassing ambiguity, violence, and social
+unacceptability dimensions. Our comprehensive evaluation demonstrates that
+specialized Polish language models achieve superior performance compared to
+multilingual alternatives, with transformer-based architectures showing
+particular strength in handling imbalanced categories. The dataset and
+accompanying analysis establish essential frameworks for developing
+linguistically-aware content moderation systems, while highlighting critical
+considerations for extending such capabilities to morphologically complex
+languages.
+
+
+
+ comment: The forePLay dataset and associated resources will be made publicly
+ available for research purposes upon publication, in accordance with data
+ sharing regulations
+
+ Large Language Models (LLMs) are susceptible to generating harmful content
+when prompted with carefully crafted inputs, a vulnerability known as LLM
+jailbreaking. As LLMs become more powerful, studying jailbreak methods is
+critical to enhancing security and aligning models with human values.
+Traditionally, jailbreak techniques have relied on suffix addition or prompt
+templates, but these methods suffer from limited attack diversity. This paper
+introduces DiffusionAttacker, an end-to-end generative approach for jailbreak
+rewriting inspired by diffusion models. Our method employs a
+sequence-to-sequence (seq2seq) text diffusion model as a generator,
+conditioning on the original prompt and guiding the denoising process with a
+novel attack loss. Unlike previous approaches that use autoregressive LLMs to
+generate jailbreak prompts, which limit the modification of already generated
+tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq
+diffusion model, allowing more flexible token modifications. This approach
+preserves the semantic content of the original prompt while producing harmful
+content. Additionally, we leverage the Gumbel-Softmax technique to make the
+sampling process from the diffusion model's output distribution differentiable,
+eliminating the need for iterative token search. Extensive experiments on
+Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous
+methods across various evaluation metrics, including attack success rate (ASR),
+fluency, and diversity.
+
+
+
+
+
+
+
+ ☆ DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
+
+
+ Recently, O1-like models have emerged as representative examples,
+illustrating the effectiveness of long chain-of-thought (CoT) in reasoning
+tasks such as math and coding tasks. In this paper, we introduce DRT-o1, an
+attempt to bring the success of long CoT to neural machine translation (MT).
+Specifically, in view of the literature books that might involve similes and
+metaphors, translating these texts to a target language is very difficult in
+practice due to cultural differences. In such cases, literal translation often
+fails to convey the intended meaning effectively. Even for professional human
+translators, considerable thought must be given to preserving semantics
+throughout the translation process. To simulate LLMs' long thought ability in
+MT, we first mine sentences containing similes or metaphors from existing
+literature books, and then develop a multi-agent framework to translate these
+sentences via long thought. In the multi-agent framework, a translator is used
+to iteratively translate the source sentence under the suggestions provided by
+an advisor. To ensure the effectiveness of the long thoughts, an evaluator is
+also employed to judge whether the translation in the current round is better
+than the previous one or not. In this manner, we collect tens of thousands of
+long-thought MT data, which is used to train our DRT-o1. The experimental
+results on literature translation demonstrate the effectiveness of the DRT-o1.
+Using Qwen2.5-7B and Qwen2.5-14B as the backbones, the improvement brought by
+DRT-o1 achieves 7.33~8.26 BLEU and 1.66~3.36 CometScore. Besides, DRT-o1-7B can
+outperform QwQ-32B-Preview by 7.82 BLEU and 1.46 CometScore, showing its
+effectiveness. The project is available at https://github.com/krystalan/DRT-o1
+
+
+
+
+
+
+
+ ☆ A Silver Bullet or a Compromise for Full Attention? A Comprehensive
+ Study of Gist Token-based Context Compression
+
+
+ In this work, we provide a thorough investigation of gist-based context
+compression methods to improve long-context processing in large language
+models. We focus on two key questions: (1) How well can these methods replace
+full attention models? and (2) What potential failure patterns arise due to
+compression? Through extensive experiments, we show that while gist-based
+compression can achieve near-lossless performance on tasks like
+retrieval-augmented generation and long-document QA, it faces challenges in
+tasks like synthetic recall. Furthermore, we identify three key failure
+patterns: lost by the boundary, lost if surprise, and lost along the way. To
+mitigate these issues, we propose two effective strategies: fine-grained
+autoencoding, which enhances the reconstruction of original token information,
+and segment-wise token importance estimation, which adjusts optimization based
+on token dependencies. Our work provides valuable insights into the
+understanding of gist token-based context compression and offers practical
+strategies for improving compression capabilities.
+
+
+
+
+
+
+
+ ☆ A Survey on Multi-Generative Agent System: Recent Advances and New
+ Frontiers
+
+
+ Multi-generative agent systems (MGASs) have become a research hotspot since
+the rise of large language models (LLMs). However, with the continuous influx
+of new related works, the existing reviews struggle to capture them
+comprehensively. This paper presents a comprehensive survey of these studies.
+We first discuss the definition of MGAS, a framework encompassing much of
+previous work. We provide an overview of the various applications of MGAS in
+(i) solving complex tasks, (ii) simulating specific scenarios, and (iii)
+evaluating generative agents. Building on previous studies, we also highlight
+several challenges and propose future directions for research in this field.
+
+
+
+ comment: 13 pages, 1 figure
+
+
+
+
+
+
+ ♻ ☆ Memorization Over Reasoning? Exposing and Mitigating Verbatim
+ Memorization in Large Language Models' Character Understanding Evaluation
+
+
+ Recently, Large Language Models (LLMs) have shown impressive performance in
+character understanding tasks, such as analyzing the roles, personalities, and
+relationships of fictional characters. However, the extensive pre-training
+corpora used by LLMs raise concerns that they may rely on memorizing popular
+fictional works rather than genuinely understanding and reasoning about them.
+In this work, we argue that 'gist memory'-capturing essential meaning - should
+be the primary mechanism for character understanding tasks, as opposed to
+'verbatim memory' - exact match of a string. We introduce a simple yet
+effective method to mitigate mechanized memorization in character understanding
+evaluations while preserving the essential implicit cues needed for
+comprehension and reasoning. Our approach reduces memorization-driven
+performance on popular fictional works from 96% accuracy to 72% and results in
+up to an 18% drop in accuracy across various character understanding tasks.
+These findings underscore the issue of data contamination in existing
+benchmarks, which often measure memorization rather than true character
+understanding.
+
+
+
+
+
+
+
+ ♻ ☆ Knowledge Graphs are all you need: Leveraging KGs in Physics Question
+ Answering
+
+
+
+
+
+
+
+
+ Krishnasai Addala, Kabir Dev Paul Baghel, Dhruv Jain, Chhavi Kirtani, Avinash Anand, Rajiv Ratn Shah
+
+
+ This study explores the effectiveness of using knowledge graphs generated by
+large language models to decompose high school-level physics questions into
+sub-questions. We introduce a pipeline aimed at enhancing model response
+quality for Question Answering tasks. By employing LLMs to construct knowledge
+graphs that capture the internal logic of the questions, these graphs then
+guide the generation of subquestions. We hypothesize that this method yields
+sub-questions that are more logically consistent with the original questions
+compared to traditional decomposition techniques. Our results show that
+sub-questions derived from knowledge graphs exhibit significantly improved
+fidelity to the original question's logic. This approach not only enhances the
+learning experience by providing clearer and more contextually appropriate
+sub-questions but also highlights the potential of LLMs to transform
+educational methodologies. The findings indicate a promising direction for
+applying AI to improve the quality and effectiveness of educational content.
+
+
+ Evaluating Large Language Models (LLMs) as general-purpose agents is
+essential for understanding their capabilities and facilitating their
+integration into practical applications. However, the evaluation process
+presents substantial challenges. A primary obstacle is the benchmarking of
+agent performance across diverse scenarios within a unified framework,
+especially in maintaining partially-observable environments and ensuring
+multi-round interactions. Moreover, current evaluation frameworks mostly focus
+on the final success rate, revealing few insights during the process and
+failing to provide a deep understanding of the model abilities. To address
+these challenges, we introduce AgentBoard, a pioneering comprehensive benchmark
+and accompanied open-source evaluation framework tailored to analytical
+evaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric
+that captures incremental advancements as well as a comprehensive evaluation
+toolkit that features easy assessment of agents for multi-faceted analysis.
+This not only sheds light on the capabilities and limitations of LLM agents but
+also propels the interpretability of their performance to the forefront.
+Ultimately, AgentBoard serves as a step towards demystifying agent behaviors
+and accelerating the development of stronger LLM agents.
+
+
+
+ comment: NeurIPS 2024 (Oral)
+
+
+
+
+
+
+ ♻ ☆ Steps are all you need: Rethinking STEM Education with Prompt
+ Engineering
+
+
+
+
+
+
+
+
+ Krishnasai Addala, Kabir Dev Paul Baghel, Chhavi Kirtani, Avinash Anand, Rajiv Ratn Shah
+
+
+ Few shot and Chain-of-Thought prompting have shown promise when applied to
+Physics Question Answering Tasks, but are limited by the lack of mathematical
+ability inherent to LLMs, and are prone to hallucination. By utilizing a
+Mixture of Experts (MoE) Model, along with analogical prompting, we are able to
+show improved model performance when compared to the baseline on standard LLMs.
+We also survey the limits of these prompting techniques and the effects they
+have on model performance. Additionally, we propose Analogical CoT prompting, a
+prompting technique designed to allow smaller, open source models to leverage
+Analogical prompting, something they have struggled with, possibly due to a
+lack of specialist training data.
+
+
+
+
+
+
+
+ ♻ ☆ Navigating the Cultural Kaleidoscope: A Hitchhiker's Guide to
+ Sensitivity in Large Language Models
+
+
+ As LLMs are increasingly deployed in global applications, the importance of
+cultural sensitivity becomes paramount, ensuring that users from diverse
+backgrounds feel respected and understood. Cultural harm can arise when these
+models fail to align with specific cultural norms, resulting in
+misrepresentations or violations of cultural values. This work addresses the
+challenges of ensuring cultural sensitivity in LLMs, especially in
+small-parameter models that often lack the extensive training data needed to
+capture global cultural nuances. We present two key contributions: (1) A
+cultural harm test dataset, created to assess model outputs across different
+cultural contexts through scenarios that expose potential cultural
+insensitivities, and (2) A culturally aligned preference dataset, aimed at
+restoring cultural sensitivity through fine-tuning based on feedback from
+diverse annotators. These datasets facilitate the evaluation and enhancement of
+LLMs, ensuring their ethical and safe deployment across different cultural
+landscapes. Our results show that integrating culturally aligned feedback leads
+to a marked improvement in model behavior, significantly reducing the
+likelihood of generating culturally insensitive or harmful content. Ultimately,
+this work paves the way for more inclusive and respectful AI systems, fostering
+a future where LLMs can safely and ethically navigate the complexities of
+diverse cultural landscapes.
+
+
+
+
+
+
+
+
+ Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland
+
+
+ Backdoors are hidden behaviors that are only triggered once an AI system has
+been deployed. Bad actors looking to create successful backdoors must design
+them to avoid activation during training and evaluation. Since data used in
+these stages often only contains information about events that have already
+occurred, a component of a simple backdoor trigger could be a model recognizing
+data that is in the future relative to when it was trained. Through prompting
+experiments and by probing internal activations, we show that current large
+language models (LLMs) can distinguish past from future events, with probes on
+model activations achieving 90% accuracy. We train models with backdoors
+triggered by a temporal distributional shift; they activate when the model is
+exposed to news headlines beyond their training cut-off dates. Fine-tuning on
+helpful, harmless and honest (HHH) data does not work well for removing simpler
+backdoor triggers but is effective on our backdoored models, although this
+distinction is smaller for the larger-scale model we tested. We also find that
+an activation-steering vector representing a model's internal representation of
+the date influences the rate of backdoor activation. We take these results as
+initial evidence that, at least for models at the modest scale we test,
+standard safety measures are enough to remove these backdoors.
+
+
+
+
+
+
+
+ ♻ ☆ LLM for Barcodes: Generating Diverse Synthetic Data for Identity
+ Documents
+
+
+ Accurate barcode detection and decoding in Identity documents is crucial for
+applications like security, healthcare, and education, where reliable data
+extraction and verification are essential. However, building robust detection
+models is challenging due to the lack of diverse, realistic datasets an issue
+often tied to privacy concerns and the wide variety of document formats.
+Traditional tools like Faker rely on predefined templates, making them less
+effective for capturing the complexity of real-world identity documents. In
+this paper, we introduce a new approach to synthetic data generation that uses
+LLMs to create contextually rich and realistic data without relying on
+predefined field. Using the vast knowledge LLMs have about different documents
+and content, our method creates data that reflects the variety found in real
+identity documents. This data is then encoded into barcode and overlayed on
+templates for documents such as Driver's licenses, Insurance cards, Student
+IDs. Our approach simplifies the process of dataset creation, eliminating the
+need for extensive domain knowledge or predefined fields. Compared to
+traditional methods like Faker, data generated by LLM demonstrates greater
+diversity and contextual relevance, leading to improved performance in barcode
+detection models. This scalable, privacy-first solution is a big step forward
+in advancing machine learning for automated document processing and identity
+verification.
+
+
+
+ comment: 5 pages, 1 figures
+
+
+
+
+
+
+ ♻ ☆ The Prompt Report: A Systematic Survey of Prompting Techniques
+
+
+
+
+
+
+
+
+ Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, Philip Resnik
+
+
+ Generative Artificial Intelligence (GenAI) systems are increasingly being
+deployed across diverse industries and research domains. Developers and
+end-users interact with these systems through the use of prompting and prompt
+engineering. Although prompt engineering is a widely adopted and extensively
+researched area, it suffers from conflicting terminology and a fragmented
+ontological understanding of what constitutes an effective prompt due to its
+relatively recent emergence. We establish a structured understanding of prompt
+engineering by assembling a taxonomy of prompting techniques and analyzing
+their applications. We present a detailed vocabulary of 33 vocabulary terms, a
+taxonomy of 58 LLM prompting techniques, and 40 techniques for other
+modalities. Additionally, we provide best practices and guidelines for prompt
+engineering, including advice for prompting state-of-the-art (SOTA) LLMs such
+as ChatGPT. We further present a meta-analysis of the entire literature on
+natural language prefix-prompting. As a culmination of these efforts, this
+paper presents the most comprehensive survey on prompt engineering to date.
+
+
+
+
+
+
+
+ ♻ ☆ Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop COLING 2025
+
+
+
+
+
+
+
+
+ Ekaterina Artemova, Akim Tsvigun, Dominik Schlechtweg, Natalia Fedorova, Sergei Tilga, Konstantin Chernyshev, Boris Obmoroshev
+
+
+ Training and deploying machine learning models relies on a large amount of
+human-annotated data. As human labeling becomes increasingly expensive and
+time-consuming, recent research has developed multiple strategies to speed up
+annotation and reduce costs and human workload: generating synthetic training
+data, active learning, and hybrid labeling. This tutorial is oriented toward
+practical applications: we will present the basics of each strategy, highlight
+their benefits and limitations, and discuss in detail real-life case studies.
+Additionally, we will walk through best practices for managing human annotators
+and controlling the quality of the final dataset. The tutorial includes a
+hands-on workshop, where attendees will be guided in implementing a hybrid
+annotation setup. This tutorial is designed for NLP practitioners from both
+research and industry backgrounds who are involved in or interested in
+optimizing data labeling projects.
+
+
+
+ comment: To be presented at COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Quantifying Positional Biases in Text Embedding Models NeurIPS
+
+
+ Embedding models are crucial for tasks in Information Retrieval (IR) and
+semantic similarity measurement, yet their handling of longer texts and
+associated positional biases remains underexplored. In this study, we
+investigate the impact of content position and input size on text embeddings.
+Our experiments reveal that embedding models, irrespective of their positional
+encoding mechanisms, disproportionately prioritize the beginning of an input.
+Ablation studies demonstrate that insertion of irrelevant text or removal at
+the start of a document reduces cosine similarity between altered and original
+embeddings by up to 12.3\% more than ablations at the end. Regression analysis
+further confirms this bias, with sentence importance declining as position
+moves further from the start, even with with content-agnosticity. We
+hypothesize that this effect arises from pre-processing strategies and chosen
+positional encoding techniques. These findings quantify the sensitivity of
+retrieval systems and suggest a new lens towards embedding model robustness.
+
+
+
+ comment: 13 pages, 11 figures, NeurIPS
+
+
+
+
+
+
+ ♻ ☆ Attention Heads of Large Language Models: A Survey
+
+
+ Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in
+various tasks but remain as black-box systems. Understanding the reasoning
+bottlenecks of LLMs has become a critical challenge, as these limitations are
+deeply tied to their internal architecture. Among these, attention heads have
+emerged as a focal point for investigating the underlying mechanics of LLMs. In
+this survey, we aim to demystify the internal reasoning processes of LLMs by
+systematically exploring the roles and mechanisms of attention heads. We first
+introduce a novel four-stage framework inspired by the human thought process:
+Knowledge Recalling, In-Context Identification, Latent Reasoning, and
+Expression Preparation. Using this framework, we comprehensively review
+existing research to identify and categorize the functions of specific
+attention heads. Additionally, we analyze the experimental methodologies used
+to discover these special heads, dividing them into two categories:
+Modeling-Free and Modeling-Required methods. We further summarize relevant
+evaluation methods and benchmarks. Finally, we discuss the limitations of
+current research and propose several potential future directions.
+
+
+ Solving mathematical problems requires advanced reasoning abilities and
+presents notable challenges for large language models. Previous works usually
+synthesize data from proprietary models to augment existing datasets, followed
+by instruction tuning to achieve top-tier results. However, our analysis of
+these datasets reveals severe biases towards easy queries, with frequent
+failures to generate any correct response for the most challenging queries.
+Hypothesizing that difficult queries are crucial to learn complex reasoning, we
+propose Difficulty-Aware Rejection Tuning (DART), a method that allocates
+difficult queries more trials during the synthesis phase, enabling more
+extensive training on difficult samples. Utilizing DART, we have created new
+datasets for mathematical problem-solving that focus more on difficult queries
+and are substantially smaller than previous ones. Remarkably, our synthesis
+process solely relies on a 7B-sized open-weight model, without reliance on the
+commonly used proprietary GPT-4. We fine-tune various base models on our
+datasets ranging from 7B to 70B in size, resulting in a series of strong models
+called DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6
+mathematical benchmarks, DART-MATH outperforms vanilla rejection tuning
+significantly, being superior or comparable to previous arts, despite using
+much smaller datasets and no proprietary models. Furthermore, our results
+position our synthetic datasets as the most effective and cost-efficient
+publicly available resources for advancing mathematical problem-solving.
+
+
+
+ comment: NeurIPS 2024. Data and model checkpoints are available at
+ https://github.com/hkust-nlp/dart-math
+
+
+
+
+
+
+ ♻ ☆ CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions
+ for RAG systems ACL
+
+
+
+
+
+
+
+
+ Sara Rosenthal, Avirup Sil, Radu Florian, Salim Roukos
+
+
+ Retrieval Augmented Generation (RAG) has become a popular application for
+large language models. It is preferable that successful RAG systems provide
+accurate answers that are supported by being grounded in a passage without any
+hallucinations. While considerable work is required for building a full RAG
+pipeline, being able to benchmark performance is also necessary. We present
+ClapNQ, a benchmark Long-form Question Answering dataset for the full RAG
+pipeline. ClapNQ includes long answers with grounded gold passages from Natural
+Questions (NQ) and a corpus to perform either retrieval, generation, or the
+full RAG pipeline. The ClapNQ answers are concise, 3x smaller than the full
+passage, and cohesive, meaning that the answer is composed fluently, often by
+integrating multiple pieces of the passage that are not contiguous. RAG models
+must adapt to these properties to be successful at ClapNQ. We present baseline
+experiments and analysis for ClapNQ that highlight areas where there is still
+significant room for improvement in grounded RAG. CLAPNQ is publicly available
+at https://github.com/primeqa/clapnq
+
+
+
+ comment: 26 pages, Accepted at TACL
+
+
+
+
+
+
+ ♻ ☆ Evidence Contextualization and Counterfactual Attribution for
+ Conversational QA over Heterogeneous Data with RAG Systems WSDM 2025
+
+
+
+
+
+
+
+
+ Rishiraj Saha Roy, Joel Schlotthauer, Chris Hinze, Andreas Foltyn, Luzian Hahn, Fabian Kuech
+
+
+ Retrieval Augmented Generation (RAG) works as a backbone for interacting with
+an enterprise's own data via Conversational Question Answering (ConvQA). In a
+RAG system, a retriever fetches passages from a collection in response to a
+question, which are then included in the prompt of a large language model (LLM)
+for generating a natural language (NL) answer. However, several RAG systems
+today suffer from two shortcomings: (i) retrieved passages usually contain
+their raw text and lack appropriate document context, negatively impacting both
+retrieval and answering quality; and (ii) attribution strategies that explain
+answer generation typically rely only on similarity between the answer and the
+retrieved passages, thereby only generating plausible but not causal
+explanations. In this work, we demonstrate RAGONITE, a RAG system that remedies
+the above concerns by: (i) contextualizing evidence with source metadata and
+surrounding text; and (ii) computing counterfactual attribution, a causal
+explanation approach where the contribution of an evidence to an answer is
+determined by the similarity of the original response to the answer obtained by
+removing that evidence. To evaluate our proposals, we release a new benchmark
+ConfQuestions: it has 300 hand-created conversational questions, each in
+English and German, coupled with ground truth URLs, completed questions, and
+answers from 215 public Confluence pages. These documents are typical of
+enterprise wiki spaces with heterogeneous elements. Experiments with RAGONITE
+on ConfQuestions show the viability of our ideas: contextualization improves
+RAG performance, and counterfactual explanations outperform standard
+attribution.
+
+
+
+ comment: Accepted at WSDM 2025, 8 pages
+
+
+
+
+
+
+ ♻ ☆ FocusLLM: Precise Understanding of Long Context by Dynamic Condensing
+
+
+ Empowering LLMs with the ability to precisely understand long contexts is
+crucial for many downstream applications. However, handling long contexts with
+conventional transformer architecture requires substantial training and
+inference resources. Existing context condensing methods cannot accurately
+understand the full context, as there is a considerable amount of information
+loss in the condensing process. To address these issues, we present FocusLLM, a
+framework designed to extend the fixed context length of any decoder-only LLM,
+allowing the model to focus on relevant information from very long sequences.
+FocusLLM first divides long text input into chunks based on the model's
+original context length. It then employs the dynamic condensing process to
+distill crucial information from each chunk. Ultimately, through the novel
+parallel decoding mechanism, FocusLLM can integrate the extracted information
+into its local context. FocusLLM stands out for great training efficiency and
+versatility: trained with an 8K input length and with much less training cost
+than previous methods, FocusLLM exhibits superior performance across downstream
+tasks and maintains strong language modeling ability when handling extensive
+long texts, even up to 400K tokens. Our code is available at
+https://github.com/leezythu/FocusLLM.
+
+
+
+
+
+
+
+ ♻ ☆ 2M-BELEBELE: Highly Multilingual Speech and American Sign Language
+ Comprehension Dataset
+
+
+
+
+
+
+
+
+ Marta R. Costa-jussà, Bokai Yu, Pierre Andrews, Belen Alastruey, Necati Cihan Camgoz, Joe Chuang, Jean Maillard, Christophe Ropers, Arina Turkantenko, Carleigh Wood
+
+
+ We introduce the first highly multilingual speech and American Sign Language
+(ASL) comprehension dataset by extending BELEBELE. Our dataset covers 74 spoken
+languages at the intersection of BELEBELE and FLEURS, and one sign language
+(ASL). We evaluate 2M-BELEBELE dataset for both 5-shot and zero-shot settings
+and across languages, the speech comprehension accuracy is ~ 2-3% average lower
+compared to reading comprehension.
+
+
+
+
+
+
+
+ ♻ ☆ CityBench: Evaluating the Capabilities of Large Language Models for
+ Urban Tasks
+
+
+
+
+
+
+
+
+ Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, Yong Li
+
+
+ Recently, large language models (LLMs) with extensive general knowledge and
+powerful reasoning abilities have seen rapid development and widespread
+application. A systematic and reliable evaluation of LLMs or vision-language
+model (VLMs) is a crucial step in applying and developing them for various
+fields. There have been some early explorations about the usability of LLMs for
+limited urban tasks, but a systematic and scalable evaluation benchmark is
+still lacking. The challenge in constructing a systematic evaluation benchmark
+for urban research lies in the diversity of urban data, the complexity of
+application scenarios and the highly dynamic nature of the urban environment.
+In this paper, we design CityBench, an interactive simulator based evaluation
+platform, as the first systematic benchmark for evaluating the capabilities of
+LLMs for diverse tasks in urban research. First, we build CityData to integrate
+the diverse urban data and CitySimu to simulate fine-grained urban dynamics.
+Based on CityData and CitySimu, we design 8 representative urban tasks in 2
+categories of perception-understanding and decision-making as the CityBench.
+With extensive results from 30 well-known LLMs and VLMs in 13 cities around the
+world, we find that advanced LLMs and VLMs can achieve competitive performance
+in diverse urban tasks requiring commonsense and semantic understanding
+abilities, e.g., understanding the human dynamics and semantic inference of
+urban images. Meanwhile, they fail to solve the challenging urban tasks
+requiring professional knowledge and high-level reasoning abilities, e.g.,
+geospatial prediction and traffic control task. These observations provide
+valuable perspectives for utilizing and developing LLMs in the future. Codes
+are openly accessible via https://github.com/tsinghua-fib-lab/CityBench.
+
+
+ Architectures such as Linformer and Mamba have recently emerged as
+competitive linear time replacements for transformers. However, corresponding
+large pretrained models are often unavailable, especially in non-text domains.
+To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD)
+approach that jointly converts a transformer model to a linear time substitute
+and fine-tunes it to a target task. We also compare several means to guide the
+fine-tuning to optimally retain the desired inference capability from the
+original model. The methods differ in their use of the target model and the
+trajectory of the parameters. In a series of empirical studies on language
+processing, language modeling, and speech processing, we show that CALD can
+effectively recover the result of the original model, and that the guiding
+strategy contributes to the result. Some reasons for the variation are
+suggested.
+
+
+
+ comment: 17 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ Large Language Model-Brained GUI Agents: A Survey
+
+
+ GUIs have long been central to human-computer interaction, providing an
+intuitive and visually-driven way to access and interact with digital systems.
+The advent of LLMs, particularly multimodal models, has ushered in a new era of
+GUI automation. They have demonstrated exceptional capabilities in natural
+language understanding, code generation, and visual processing. This has paved
+the way for a new generation of LLM-brained GUI agents capable of interpreting
+complex GUI elements and autonomously executing actions based on natural
+language instructions. These agents represent a paradigm shift, enabling users
+to perform intricate, multi-step tasks through simple conversational commands.
+Their applications span across web navigation, mobile app interactions, and
+desktop automation, offering a transformative user experience that
+revolutionizes how individuals interact with software. This emerging field is
+rapidly advancing, with significant progress in both research and industry.
+ To provide a structured understanding of this trend, this paper presents a
+comprehensive survey of LLM-brained GUI agents, exploring their historical
+evolution, core components, and advanced techniques. We address research
+questions such as existing GUI agent frameworks, the collection and utilization
+of data for training specialized GUI agents, the development of large action
+models tailored for GUI tasks, and the evaluation metrics and benchmarks
+necessary to assess their effectiveness. Additionally, we examine emerging
+applications powered by these agents. Through a detailed analysis, this survey
+identifies key research gaps and outlines a roadmap for future advancements in
+the field. By consolidating foundational knowledge and state-of-the-art
+developments, this work aims to guide both researchers and practitioners in
+overcoming challenges and unlocking the full potential of LLM-brained GUI
+agents.
+
+
+
+ comment: The collection of papers reviewed in this survey will be hosted and
+ regularly updated on the GitHub repository:
+ https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a
+ searchable webpage is available at https://aka.ms/gui-agent for easier access
+ and exploration
+
+
+
+
+
+
+
+ Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, Julia Hirschberg
+
+
+ Emotion recognition in speech is a challenging multimodal task that requires
+understanding both verbal content and vocal nuances. This paper introduces a
+novel approach to emotion detection using Large Language Models (LLMs), which
+have demonstrated exceptional capabilities in natural language understanding.
+To overcome the inherent limitation of LLMs in processing audio inputs, we
+propose SpeechCueLLM, a method that translates speech characteristics into
+natural language descriptions, allowing LLMs to perform multimodal emotion
+analysis via text prompts without any architectural changes. Our method is
+minimal yet impactful, outperforming baseline models that require structural
+modifications. We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD,
+showing significant improvements in emotion recognition accuracy, particularly
+for high-quality audio data. We also explore the effectiveness of various
+feature representations and fine-tuning strategies for different LLMs. Our
+experiments demonstrate that incorporating speech descriptions yields a more
+than 2% increase in the average weighted F1 score on IEMOCAP (from 70.111% to
+72.596%).
+
+
+
+
+
+
+
+ ♻ ☆ Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from
+ Disparate Training Data NeurIPS 2024
+
+
+
+
+
+
+
+
+ Johannes Treutlein, Dami Choi, Jan Betley, Samuel Marks, Cem Anil, Roger Grosse, Owain Evans
+
+
+ One way to address safety risks from large language models (LLMs) is to
+censor dangerous knowledge from their training data. While this removes the
+explicit information, implicit information can remain scattered across various
+training documents. Could an LLM infer the censored knowledge by piecing
+together these implicit hints? As a step towards answering this question, we
+study inductive out-of-context reasoning (OOCR), a type of generalization in
+which LLMs infer latent information from evidence distributed across training
+documents and apply it to downstream tasks without in-context learning. Using a
+suite of five tasks, we demonstrate that frontier LLMs can perform inductive
+OOCR. In one experiment we finetune an LLM on a corpus consisting only of
+distances between an unknown city and other known cities. Remarkably, without
+in-context examples or Chain of Thought, the LLM can verbalize that the unknown
+city is Paris and use this fact to answer downstream questions. Further
+experiments show that LLMs trained only on individual coin flip outcomes can
+verbalize whether the coin is biased, and those trained only on pairs
+$(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR
+succeeds in a range of cases, we also show that it is unreliable, particularly
+for smaller LLMs learning complex structures. Overall, the ability of LLMs to
+"connect the dots" without explicit in-context learning poses a potential
+obstacle to monitoring and controlling the knowledge acquired by LLMs.
+
+
+ With the rapid advancement of Large Language Models (LLMs), significant
+safety concerns have emerged. Fundamentally, the safety of large language
+models is closely linked to the accuracy, comprehensiveness, and clarity of
+their understanding of safety knowledge, particularly in domains such as law,
+policy and ethics. This factuality ability is crucial in determining whether
+these models can be deployed and applied safely and compliantly within specific
+regions. To address these challenges and better evaluate the factuality ability
+of LLMs to answer short questions, we introduce the Chinese SafetyQA benchmark.
+Chinese SafetyQA has several properties (i.e., Chinese, Diverse, High-quality,
+Static, Easy-to-evaluate, Safety-related, Harmless). Based on Chinese SafetyQA,
+we perform a comprehensive evaluation on the factuality abilities of existing
+LLMs and analyze how these capabilities relate to LLM abilities, e.g., RAG
+ability and robustness against attacks.
+
+
+
+
+
+
+
+ ♻ ☆ CLEAR: Character Unlearning in Textual and Visual Modalities
+
+
+
+
+
+
+
+
+ Alexey Dontsov, Dmitrii Korzh, Alexey Zhavoronkin, Boris Mikheev, Denis Bobkov, Aibek Alanov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
+
+
+ Machine Unlearning (MU) is critical for enhancing privacy and security in
+deep learning models, particularly in large multimodal language models (MLLMs),
+by removing specific private or hazardous information. While MU has made
+significant progress in textual and visual modalities, multimodal unlearning
+(MMU) remains significantly underexplored, partially due to the absence of a
+suitable open-source benchmark. To address this, we introduce CLEAR, a new
+benchmark designed to evaluate MMU methods. CLEAR contains 200 fictitious
+individuals and 3,700 images linked with corresponding question-answer pairs,
+enabling a thorough evaluation across modalities. We assess 10 MU methods,
+adapting them for MMU, and highlight new challenges specific to multimodal
+forgetting. The dataset is available at
+https://huggingface.co/datasets/therem/CLEAR
+
+
+
+
+
+
+
+ ♻ ☆ AutoLife: Automatic Life Journaling with Smartphones and LLMs
+
+
+ This paper introduces a novel mobile sensing application - life journaling -
+designed to generate semantic descriptions of users' daily lives. We present
+AutoLife, an automatic life journaling system based on commercial smartphones.
+AutoLife only inputs low-cost sensor data (without photos or audio) from
+smartphones and can automatically generate comprehensive life journals for
+users. To achieve this, we first derive time, motion, and location contexts
+from multimodal sensor data, and harness the zero-shot capabilities of Large
+Language Models (LLMs), enriched with commonsense knowledge about human lives,
+to interpret diverse contexts and generate life journals. To manage the task
+complexity and long sensing duration, a multilayer framework is proposed, which
+decomposes tasks and seamlessly integrates LLMs with other techniques for life
+journaling. This study establishes a real-life dataset as a benchmark and
+extensive experiment results demonstrate that AutoLife produces accurate and
+reliable life journals.
+
+
+ Accurately predicting pedestrian trajectories is crucial in applications such
+as autonomous driving or service robotics, to name a few. Deep generative
+models achieve top performance in this task, assuming enough labelled
+trajectories are available for training. To this end, large amounts of
+synthetically generated, labelled trajectories exist (e.g., generated by video
+games). However, such trajectories are not meant to represent pedestrian motion
+realistically and are ineffective at training a predictive model. We propose a
+method and an architecture to augment synthetic trajectories at training time
+and with an adversarial approach. We show that trajectory augmentation at
+training time unleashes significant gains when a state-of-the-art generative
+model is evaluated over real-world trajectories.
+
+
+
+
+
+
+
+ ☆ LayerDropBack: A Universally Applicable Approach for Accelerating
+ Training of Deep Networks
+
+
+ Training very deep convolutional networks is challenging, requiring
+significant computational resources and time. Existing acceleration methods
+often depend on specific architectures or require network modifications. We
+introduce LayerDropBack (LDB), a simple yet effective method to accelerate
+training across a wide range of deep networks. LDB introduces randomness only
+in the backward pass, maintaining the integrity of the forward pass,
+guaranteeing that the same network is used during both training and inference.
+LDB can be seamlessly integrated into the training process of any model without
+altering its architecture, making it suitable for various network topologies.
+Our extensive experiments across multiple architectures (ViT, Swin Transformer,
+EfficientNet, DLA) and datasets (CIFAR-100, ImageNet) show significant training
+time reductions of 16.93\% to 23.97\%, while preserving or even enhancing model
+accuracy. Code is available at \url{https://github.com/neiterman21/LDB}.
+
+
+
+
+
+
+
+ ☆ Online Adaptation for Myographic Control of Natural Dexterous Hand and
+ Finger Movements
+
+
+
+
+
+
+
+
+ Joseph L. Betthauser, Rebecca Greene, Ananya Dhawan, John T. Krall, Christopher L. Hunt, Gyorgy Levay, Rahul R. Kaliki, Matthew S. Fifer, Siddhartha Sikdar, Nitish V. Thakor
+
+
+ One of the most elusive goals in myographic prosthesis control is the ability
+to reliably decode continuous positions simultaneously across multiple
+degrees-of-freedom. Goal: To demonstrate dexterous, natural, biomimetic finger
+and wrist control of the highly advanced robotic Modular Prosthetic Limb.
+Methods: We combine sequential temporal regression models and reinforcement
+learning using myographic signals to predict continuous simultaneous
+predictions of 7 finger and wrist degrees-of-freedom for 9 non-amputee human
+subjects in a minimally-constrained freeform training process. Results: We
+demonstrate highly dexterous 7 DoF position-based regression for prosthesis
+control from EMG signals, with significantly lower error rates than traditional
+approaches (p < 0.001) and nearly zero prediction response time delay (p <
+0.001). Their performance can be continuously improved at any time using our
+freeform reinforcement process. Significance: We have demonstrated the most
+dexterous, biomimetic, and natural prosthesis control performance ever obtained
+from the surface EMG signal. Our reinforcement approach allowed us to abandon
+standard training protocols and simply allow the subject to move in any desired
+way while our models adapt. Conclusions: This work redefines the
+state-of-the-art in myographic decoding in terms of the reliability,
+responsiveness, and movement complexity available from prosthesis control
+systems. The present-day emergence and convergence of advanced algorithmic
+methods, experiment protocols, dexterous robotic prostheses, and sensor
+modalities represents a unique opportunity to finally realize our ultimate goal
+of achieving fully restorative natural upper-limb function for amputees.
+
+
+
+ comment: Modified from Chapter 5 of J. L. Betthauser, "Robust Adaptive
+ Strategies for Myographic Prosthesis Movement Decoding," Doctoral
+ Dissertation, Dept. of Electrical and Computer Engr, Johns Hopkins
+ University, 2020
+
+
+
+
+
+
+ ☆ ICPR 2024 Competition on Domain Adaptation and GEneralization for
+ Character Classification (DAGECC) ICPR 2024
+
+
+
+
+
+
+
+
+ Sofia Marino, Jennifer Vandoni, Emanuel Aldea, Ichraq Lemghari, Sylvie Le Hégarat-Mascle, Frédéric Jurie
+
+
+ In this companion paper for the DAGECC (Domain Adaptation and GEneralization
+for Character Classification) competition organized within the frame of the
+ICPR 2024 conference, we present the general context of the tasks we proposed
+to the community, we introduce the data that were prepared for the competition
+and we provide a summary of the results along with a description of the top
+three winning entries. The competition was centered around domain adaptation
+and generalization, and our core aim is to foster interest and facilitate
+advancement on these topics by providing a high-quality, lightweight, real
+world dataset able to support fast prototyping and validation of novel ideas.
+
+
+
+ comment: Companion paper for the ICPR 2024 Competition on Domain Adaptation
+ and GEneralization for Character Classification (DAGECC)
+
+
+
+
+
+
+ ☆ Unsupervised learning of spatially varying regularization for
+ diffeomorphic image registration
+
+
+ Spatially varying regularization accommodates the deformation variations that
+may be necessary for different anatomical regions during deformable image
+registration. Historically, optimization-based registration models have
+harnessed spatially varying regularization to address anatomical subtleties.
+However, most modern deep learning-based models tend to gravitate towards
+spatially invariant regularization, wherein a homogenous regularization
+strength is applied across the entire image, potentially disregarding localized
+variations. In this paper, we propose a hierarchical probabilistic model that
+integrates a prior distribution on the deformation regularization strength,
+enabling the end-to-end learning of a spatially varying deformation regularizer
+directly from the data. The proposed method is straightforward to implement and
+easily integrates with various registration network architectures.
+Additionally, automatic tuning of hyperparameters is achieved through Bayesian
+optimization, allowing efficient identification of optimal hyperparameters for
+any given registration task. Comprehensive evaluations on publicly available
+datasets demonstrate that the proposed method significantly improves
+registration performance and enhances the interpretability of deep
+learning-based registration, all while maintaining smooth deformations.
+
+
+
+ comment: Code available at http://bit.ly/3BrXGxz
+
+
+
+
+
+
+ ☆ Improving Sickle Cell Disease Classification: A Fusion of Conventional
+ Classifiers, Segmented Images, and Convolutional Neural Networks
+
+
+
+
+
+
+
+
+ Victor Júnio Alcântara Cardoso, Rodrigo Moreira, João Fernando Mari, Larissa Ferreira Rodrigues Moreira
+
+
+ Sickle cell anemia, which is characterized by abnormal erythrocyte
+morphology, can be detected using microscopic images. Computational techniques
+in medicine enhance the diagnosis and treatment efficiency. However, many
+computational techniques, particularly those based on Convolutional Neural
+Networks (CNNs), require high resources and time for training, highlighting the
+research opportunities in methods with low computational overhead. In this
+paper, we propose a novel approach combining conventional classifiers,
+segmented images, and CNNs for the automated classification of sickle cell
+disease. We evaluated the impact of segmented images on classification,
+providing insight into deep learning integration. Our results demonstrate that
+using segmented images and CNN features with an SVM achieves an accuracy of
+96.80%. This finding is relevant for computationally efficient scenarios,
+paving the way for future research and advancements in medical-image analysis.
+
+
+
+ comment: 14 pages
+
+
+
+
+
+
+ ☆ A Multimodal Fusion Framework for Bridge Defect Detection with
+ Cross-Verification
+
+
+ This paper presents a pilot study introducing a multimodal fusion framework
+for the detection and analysis of bridge defects, integrating Non-Destructive
+Evaluation (NDE) techniques with advanced image processing to enable precise
+structural assessment. By combining data from Impact Echo (IE) and Ultrasonic
+Surface Waves (USW) methods, this preliminary investigation focuses on
+identifying defect-prone regions within concrete structures, emphasizing
+critical indicators such as delamination and debonding. Using geospatial
+analysis with alpha shapes, fusion of defect points, and unified lane
+boundaries, the proposed framework consolidates disparate data sources to
+enhance defect localization and facilitate the identification of overlapping
+defect regions. Cross-verification with adaptive image processing further
+validates detected defects by aligning their coordinates with visual data,
+utilizing advanced contour-based mapping and bounding box techniques for
+precise defect identification. The experimental results, with an F1 score of
+0.83, demonstrate the potential efficacy of the approach in improving defect
+localization, reducing false positives, and enhancing detection accuracy, which
+provides a foundation for future research and larger-scale validation. This
+preliminary exploration establishes the framework as a promising tool for
+efficient bridge health assessment, with implications for proactive structural
+monitoring and maintenance.
+
+
+ Neural surfaces (e.g., neural map encoding, deep implicits and neural
+radiance fields) have recently gained popularity because of their generic
+structure (e.g., multi-layer perceptron) and easy integration with modern
+learning-based setups. Traditionally, we have a rich toolbox of geometry
+processing algorithms designed for polygonal meshes to analyze and operate on
+surface geometry. In the absence of an analogous toolbox, neural
+representations are typically discretized and converted into a mesh, before
+applying any geometry processing algorithm. This is unsatisfactory and, as we
+demonstrate, unnecessary. In this work, we propose a spherical neural surface
+representation for genus-0 surfaces and demonstrate how to compute core
+geometric operators directly on this representation. Namely, we estimate
+surface normals and first and second fundamental forms of the surface, as well
+as compute surface gradient, surface divergence and Laplace-Beltrami operator
+on scalar/vector fields defined on the surface. Our representation is fully
+seamless, overcoming a key limitation of similar explicit representations such
+as Neural Surface Maps [Morreale et al. 2021]. These operators, in turn, enable
+geometry processing directly on the neural representations without any
+unnecessary meshing. We demonstrate illustrative applications in (neural)
+spectral analysis, heat flow and mean curvature flow, and evaluate robustness
+to isometric shape variations. We propose theoretical formulations and validate
+their numerical estimates, against analytical estimates, mesh-based baselines,
+and neural alternatives, where available. By systematically linking neural
+surface representations with classical geometry processing algorithms, we
+believe that this work can become a key ingredient in enabling neural geometry
+processing. Code will be released upon acceptance, accessible from the project
+webpage.
+
+
+
+ comment: 14 pages, 14 figures
+
+
+
+
+
+
+ ♻ ☆ ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line
+ Scanning
+
+
+
+
+
+
+
+
+ Samuel Garske, Bradley Evans, Christopher Artlett, KC Wong
+
+
+ Detecting unexpected objects (anomalies) in real time has great potential for
+monitoring, managing, and protecting the environment. Hyperspectral line-scan
+cameras are a low-cost solution that enhance confidence in anomaly detection
+over RGB and multispectral imagery. However, existing line-scan algorithms are
+too slow when using small computers (e.g. those onboard a drone or small
+satellite), do not adapt to changing scenery, or lack robustness against
+geometric distortions. This paper introduces the Exponentially moving RX
+algorithm (ERX) to address these issues, and compares it with four existing
+RX-based anomaly detection methods for hyperspectral line scanning. Three large
+and more complex datasets are also introduced to better assess the practical
+challenges when using line-scan cameras (two hyperspectral and one
+multispectral). ERX was evaluated using a Jetson Xavier NX edge computing
+module (6-core CPU, 8GB RAM, 20W power draw), achieving the best combination of
+speed and detection performance. ERX was 9 times faster than the next-best
+algorithm on the dataset with the highest number of bands (108 band), with an
+average speed of 561 lines per second on the Jetson. It achieved a 29.3% AUC
+improvement over the next-best algorithm on the most challenging dataset, while
+showing greater adaptability through consistently high AUC scores regardless of
+the camera's starting location. ERX performed robustly across all datasets,
+achieving an AUC of 0.941 on a drone-collected hyperspectral line scan dataset
+without geometric corrections (a 16.9% improvement over existing algorithms).
+This work enables future research on the detection of anomalous objects in real
+time, adaptive and automatic threshold selection, and real-time field tests.
+The datasets and the Python code are openly available at:
+https://github.com/WiseGamgee/HyperAD, promoting accessibility and future work.
+
+
+
+ comment: 17 pages, 13 figures, 4 tables, code and datasets accessible at
+ https://github.com/WiseGamgee/HyperAD
+
+
+
+
+
+
+ ♻ ☆ Label-Efficient Data Augmentation with Video Diffusion Models for
+ Guidewire Segmentation in Cardiac Fluoroscopy AAAI 2025
+
+
+
+
+
+
+
+
+ Shaoyan Pan, Yikang Liu, Lin Zhao, Eric Z. Chen, Xiao Chen, Terrence Chen, Shanhui Sun
+
+
+ The accurate segmentation of guidewires in interventional cardiac fluoroscopy
+videos is crucial for computer-aided navigation tasks. Although deep learning
+methods have demonstrated high accuracy and robustness in wire segmentation,
+they require substantial annotated datasets for generalizability, underscoring
+the need for extensive labeled data to enhance model performance. To address
+this challenge, we propose the Segmentation-guided Frame-consistency Video
+Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy
+videos, augmenting the training data for wire segmentation networks. SF-VD
+leverages videos with limited annotations by independently modeling scene
+distribution and motion distribution. It first samples the scene distribution
+by generating 2D fluoroscopy images with wires positioned according to a
+specified input mask, and then samples the motion distribution by progressively
+generating subsequent frames, ensuring frame-to-frame coherence through a
+frame-consistency strategy. A segmentation-guided mechanism further refines the
+process by adjusting wire contrast, ensuring a diverse range of visibility in
+the synthesized image. Evaluation on a fluoroscopy dataset confirms the
+superior quality of the generated videos and shows significant improvements in
+guidewire segmentation.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+
+
+
+ Information Retrieval 17
+
+
+
+
+
+ ☆ Time-Probability Dependent Knowledge Extraction in IoT-enabled Smart
+ Building
+
+
+ Smart buildings incorporate various emerging Internet of Things (IoT)
+applications for comprehensive management of energy efficiency, human comfort,
+automation, and security. However, the development of a knowledge extraction
+framework is fundamental. Currently, there is a lack of a unified and practical
+framework for modeling heterogeneous sensor data within buildings. In this
+paper, we propose a practical inference framework for extracting
+status-to-event knowledge within smart building. Our proposal includes
+IoT-based API integration, ontology model design, and time probability
+dependent knowledge extraction methods. The Building Topology Ontology (BOT)
+was leveraged to construct spatial relations among sensors and spaces within
+the building. We utilized Apache Jena Fuseki's SPARQL server for storing and
+querying the RDF triple data. Two types of knowledge could be extracted:
+timestamp-based probability for abnormal event detection and time
+interval-based probability for conjunction of multiple events. We conducted
+experiments (over a 78-day period) in a real smart building environment. The
+data of light and elevator states has been collected for evaluation. The
+evaluation revealed several inferred events, such as room occupancy, elevator
+trajectory tracking, and the conjunction of both events. The numerical values
+of detected event counts and probability demonstrate the potential for
+automatic control in the smart building.
+
+
+
+
+
+
+
+ ☆ WavePulse: Real-time Content Analytics of Radio Livestreams
+
+
+ Radio remains a pervasive medium for mass information dissemination, with
+AM/FM stations reaching more Americans than either smartphone-based social
+networking or live television. Increasingly, radio broadcasts are also streamed
+online and accessed over the Internet. We present WavePulse, a framework that
+records, documents, and analyzes radio content in real-time. While our
+framework is generally applicable, we showcase the efficacy of WavePulse in a
+collaborative project with a team of political scientists focusing on the 2024
+Presidential Elections. We use WavePulse to monitor livestreams of 396 news
+radio stations over a period of three months, processing close to 500,000 hours
+of audio streams. These streams were converted into time-stamped, diarized
+transcripts and analyzed to track answer key political science questions at
+both the national and state levels. Our analysis revealed how local issues
+interacted with national trends, providing insights into information flow. Our
+results demonstrate WavePulse's efficacy in capturing and analyzing content
+from radio livestreams sourced from the Web. Code and dataset can be accessed
+at \url{https://wave-pulse.io}.
+
+
+
+ comment: 22 Pages: 10 main + 12 appendix, 24 figures. Access code and dataset
+ at https://wave-pulse.io
+
+ Leveraging Large Language Models (LLMs) to harness user-item interaction
+histories for item generation has emerged as a promising paradigm in generative
+recommendation. However, the limited context window of LLMs often restricts
+them to focusing on recent user interactions only, leading to the neglect of
+long-term interests involved in the longer histories. To address this
+challenge, we propose a novel Automatic Memory-Retrieval framework (AutoMR),
+which is capable of storing long-term interests in the memory and extracting
+relevant information from it for next-item generation within LLMs. Extensive
+experimental results on two real-world datasets demonstrate the effectiveness
+of our proposed AutoMR framework in utilizing long-term interests for
+generative recommendation.
+
+
+
+
+
+
+
+ ☆ Comparative Analysis of Document-Level Embedding Methods for Similarity
+ Scoring on Shakespeare Sonnets and Taylor Swift Lyrics
+
+
+ This study evaluates the performance of TF-IDF weighting, averaged Word2Vec
+embeddings, and BERT embeddings for document similarity scoring across two
+contrasting textual domains. By analysing cosine similarity scores, the
+methods' strengths and limitations are highlighted. The findings underscore
+TF-IDF's reliance on lexical overlap and Word2Vec's superior semantic
+generalisation, particularly in cross-domain comparisons. BERT demonstrates
+lower performance in challenging domains, likely due to insufficient
+domainspecific fine-tuning.
+
+
+
+ comment: 9 pages, 4 figures
+
+
+
+
+
+
+ ☆ CiteBART: Learning to Generate Citations for Local Citation
+ Recommendation
+
+
+ Citations are essential building blocks in scientific writing. The scientific
+community is longing for support in their generation. Citation generation
+involves two complementary subtasks: Determining the citation worthiness of a
+context and, if it's worth it, proposing the best candidate papers for the
+citation placeholder. The latter subtask is called local citation
+recommendation (LCR). This paper proposes CiteBART, a custom BART pre-training
+based on citation token masking to generate citations to achieve LCR. In the
+base scheme, we mask the citation token in the local citation context to make
+the citation prediction. In the global one, we concatenate the citing paper's
+title and abstract to the local citation context to learn to reconstruct the
+citation token. CiteBART outperforms state-of-the-art approaches on the
+citation recommendation benchmarks except for the smallest FullTextPeerRead
+dataset. The effect is significant in the larger benchmarks, e.g., Refseer and
+ArXiv. We present a qualitative analysis and an ablation study to provide
+insights into the workings of CiteBART. Our analyses confirm that its
+generative nature brings about a zero-shot capability.
+
+
+ Multi Scenario Recommendation (MSR) tasks, referring to building a unified
+model to enhance performance across all recommendation scenarios, have recently
+gained much attention. However, current research in MSR faces two significant
+challenges that hinder the field's development: the absence of uniform
+procedures for multi-scenario dataset processing, thus hindering fair
+comparisons, and most models being closed-sourced, which complicates
+comparisons with current SOTA models. Consequently, we introduce our benchmark,
+\textbf{Scenario-Wise Rec}, which comprises 6 public datasets and 12 benchmark
+models, along with a training and evaluation pipeline. Additionally, we
+validated the benchmark using an industrial advertising dataset, reinforcing
+its reliability and applicability in real-world scenarios. We aim for this
+benchmark to offer researchers valuable insights from prior work, enabling the
+development of novel models based on our benchmark and thereby fostering a
+collaborative research ecosystem in MSR. Our source code is also publicly
+available.
+
+
+
+
+
+
+
+ ☆ Efficient fine-tuning methodology of text embedding models for
+ information retrieval: contrastive learning penalty (clp)
+
+
+ Text embedding models play a crucial role in natural language processing,
+particularly in information retrieval, and their importance is further
+highlighted with the recent utilization of RAG (Retrieval- Augmented
+Generation). This study presents an efficient fine-tuning methodology
+encompassing data selection, loss function, and model architecture to enhance
+the information retrieval performance of pre-trained text embedding models. In
+particular, this study proposes a novel Contrastive Learning Penalty function
+that overcomes the limitations of existing Contrastive Learning. The proposed
+methodology achieves significant performance improvements over existing methods
+in document retrieval tasks. This study is expected to contribute to improving
+the performance of information retrieval systems through fine-tuning of text
+embedding models. The code for this study can be found at
+https://github.com/CreaLabs/Enhanced-BGE-M3-with-CLP-and-MoE, and the
+best-performing model can be found at https://huggingface.co/CreaLabs.
+
+
+
+
+
+
+
+ ☆ Popularity Estimation and New Bundle Generation using Content and
+ Context based Embeddings
+
+
+ Recommender systems create enormous value for businesses and their consumers.
+They increase revenue for businesses while improving the consumer experience by
+recommending relevant products amidst huge product base. Product bundling is an
+exciting development in the field of product recommendations. It aims at
+generating new bundles and recommending exciting and relevant bundles to their
+consumers. Unlike traditional recommender systems that recommend single items
+to consumers, product bundling aims at targeting a bundle, or a set of items,
+to the consumers. While bundle recommendation has attracted significant
+research interest recently, extant literature on bundle generation is scarce.
+Moreover, metrics to identify if a bundle is popular or not is not well
+studied. In this work, we aim to fulfill this gap by introducing new bundle
+popularity metrics based on sales, consumer experience and item diversity in a
+bundle. We use these metrics in the methodology proposed in this paper to
+generate new bundles for mobile games using content aware and context aware
+embeddings. We use opensource Steam Games dataset for our analysis. Our
+experiments indicate that we can generate new bundles that can outperform the
+existing bundles on the popularity metrics by 32% - 44%. Our experiments are
+computationally efficient and the proposed methodology is generic that can be
+extended to other bundling problems e.g. product bundling, music bundling.
+
+
+ With the increasing intelligence and autonomy of LLM agents, their potential
+applications in the legal domain are becoming increasingly apparent. However,
+existing general-domain benchmarks cannot fully capture the complexity and
+subtle nuances of real-world judicial cognition and decision-making. Therefore,
+we propose LegalAgentBench, a comprehensive benchmark specifically designed to
+evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17
+corpora from real-world legal scenarios and provides 37 tools for interacting
+with external knowledge. We designed a scalable task construction framework and
+carefully annotated 300 tasks. These tasks span various types, including
+multi-hop reasoning and writing, and range across different difficulty levels,
+effectively reflecting the complexity of real-world legal scenarios. Moreover,
+beyond evaluating final success, LegalAgentBench incorporates keyword analysis
+during intermediate processes to calculate progress rates, enabling more
+fine-grained evaluation. We evaluated eight popular LLMs, highlighting the
+strengths, limitations, and potential areas for improvement of existing models
+and methods. LegalAgentBench sets a new benchmark for the practical application
+of LLMs in the legal domain, with its code and data available at
+\url{https://github.com/CSHaitao/LegalAgentBench}.
+
+
+ The performance of Dense retrieval (DR) is significantly influenced by the
+quality of negative sampling. Traditional DR methods primarily depend on naive
+negative sampling techniques or on mining hard negatives through external
+retriever and meticulously crafted strategies. However, naive negative sampling
+often fails to adequately capture the accurate boundaries between positive and
+negative samples, whereas existing hard negative sampling methods are prone to
+false negatives, resulting in performance degradation and training instability.
+Recent advancements in large language models (LLMs) offer an innovative
+solution to these challenges by generating contextually rich and diverse
+negative samples. In this work, we present a framework that harnesses LLMs to
+synthesize high-quality hard negative samples. We first devise a
+\textit{multi-attribute self-reflection prompting strategy} to direct LLMs in
+hard negative sample generation. Then, we implement a \textit{hybrid sampling
+strategy} that integrates these synthetic negatives with traditionally
+retrieved negatives, thereby stabilizing the training process and improving
+retrieval performance. Extensive experiments on five benchmark datasets
+demonstrate the efficacy of our approach, and code is also publicly available.
+
+
+
+
+
+
+
+ ☆ GraphHash: Graph Clustering Enables Parameter Efficiency in Recommender
+ Systems
+
+
+
+
+
+
+
+
+ Xinyi Wu, Donald Loveland, Runjin Chen, Yozen Liu, Xin Chen, Leonardo Neves, Ali Jadbabaie, Clark Mingxuan Ju, Neil Shah, Tong Zhao
+
+
+ Deep recommender systems rely heavily on large embedding tables to handle
+high-cardinality categorical features such as user/item identifiers, and face
+significant memory constraints at scale. To tackle this challenge, hashing
+techniques are often employed to map multiple entities to the same embedding
+and thus reduce the size of the embedding tables. Concurrently, graph-based
+collaborative signals have emerged as powerful tools in recommender systems,
+yet their potential for optimizing embedding table reduction remains
+unexplored. This paper introduces GraphHash, the first graph-based approach
+that leverages modularity-based bipartite graph clustering on user-item
+interaction graphs to reduce embedding table sizes. We demonstrate that the
+modularity objective has a theoretical connection to message-passing, which
+provides a foundation for our method. By employing fast clustering algorithms,
+GraphHash serves as a computationally efficient proxy for message-passing
+during preprocessing and a plug-and-play graph-based alternative to traditional
+ID hashing. Extensive experiments show that GraphHash substantially outperforms
+diverse hashing baselines on both retrieval and click-through-rate prediction
+tasks. In particular, GraphHash achieves on average a 101.52% improvement in
+recall when reducing the embedding table size by more than 75%, highlighting
+the value of graph-based collaborative information for model reduction.
+
+
+
+
+
+
+
+ ☆ Unity is Strength: Unifying Convolutional and Transformeral Features for
+ Better Person Re-Identification
+
+
+ Person Re-identification (ReID) aims to retrieve the specific person across
+non-overlapping cameras, which greatly helps intelligent transportation
+systems. As we all know, Convolutional Neural Networks (CNNs) and Transformers
+have the unique strengths to extract local and global features, respectively.
+Considering this fact, we focus on the mutual fusion between them to learn more
+comprehensive representations for persons. In particular, we utilize the
+complementary integration of deep features from different model structures. We
+propose a novel fusion framework called FusionReID to unify the strengths of
+CNNs and Transformers for image-based person ReID. More specifically, we first
+deploy a Dual-branch Feature Extraction (DFE) to extract features through CNNs
+and Transformers from a single image. Moreover, we design a novel
+Dual-attention Mutual Fusion (DMF) to achieve sufficient feature fusions. The
+DMF comprises Local Refinement Units (LRU) and Heterogenous Transmission
+Modules (HTM). LRU utilizes depth-separable convolutions to align deep features
+in channel dimensions and spatial sizes. HTM consists of a Shared Encoding Unit
+(SEU) and two Mutual Fusion Units (MFU). Through the continuous stacking of
+HTM, deep features after LRU are repeatedly utilized to generate more
+discriminative features. Extensive experiments on three public ReID benchmarks
+demonstrate that our method can attain superior performances than most
+state-of-the-arts. The source code is available at
+https://github.com/924973292/FusionReID.
+
+
+
+ comment: Accepted by Trans. on ITS
+
+
+
+
+
+
+ ♻ ☆ Quantifying Positional Biases in Text Embedding Models NeurIPS
+
+
+ Embedding models are crucial for tasks in Information Retrieval (IR) and
+semantic similarity measurement, yet their handling of longer texts and
+associated positional biases remains underexplored. In this study, we
+investigate the impact of content position and input size on text embeddings.
+Our experiments reveal that embedding models, irrespective of their positional
+encoding mechanisms, disproportionately prioritize the beginning of an input.
+Ablation studies demonstrate that insertion of irrelevant text or removal at
+the start of a document reduces cosine similarity between altered and original
+embeddings by up to 12.3\% more than ablations at the end. Regression analysis
+further confirms this bias, with sentence importance declining as position
+moves further from the start, even with with content-agnosticity. We
+hypothesize that this effect arises from pre-processing strategies and chosen
+positional encoding techniques. These findings quantify the sensitivity of
+retrieval systems and suggest a new lens towards embedding model robustness.
+
+
+
+ comment: 13 pages, 11 figures, NeurIPS
+
+
+
+
+
+
+ ♻ ☆ Evidence Contextualization and Counterfactual Attribution for
+ Conversational QA over Heterogeneous Data with RAG Systems WSDM 2025
+
+
+
+
+
+
+
+
+ Rishiraj Saha Roy, Joel Schlotthauer, Chris Hinze, Andreas Foltyn, Luzian Hahn, Fabian Kuech
+
+
+ Retrieval Augmented Generation (RAG) works as a backbone for interacting with
+an enterprise's own data via Conversational Question Answering (ConvQA). In a
+RAG system, a retriever fetches passages from a collection in response to a
+question, which are then included in the prompt of a large language model (LLM)
+for generating a natural language (NL) answer. However, several RAG systems
+today suffer from two shortcomings: (i) retrieved passages usually contain
+their raw text and lack appropriate document context, negatively impacting both
+retrieval and answering quality; and (ii) attribution strategies that explain
+answer generation typically rely only on similarity between the answer and the
+retrieved passages, thereby only generating plausible but not causal
+explanations. In this work, we demonstrate RAGONITE, a RAG system that remedies
+the above concerns by: (i) contextualizing evidence with source metadata and
+surrounding text; and (ii) computing counterfactual attribution, a causal
+explanation approach where the contribution of an evidence to an answer is
+determined by the similarity of the original response to the answer obtained by
+removing that evidence. To evaluate our proposals, we release a new benchmark
+ConfQuestions: it has 300 hand-created conversational questions, each in
+English and German, coupled with ground truth URLs, completed questions, and
+answers from 215 public Confluence pages. These documents are typical of
+enterprise wiki spaces with heterogeneous elements. Experiments with RAGONITE
+on ConfQuestions show the viability of our ideas: contextualization improves
+RAG performance, and counterfactual explanations outperform standard
+attribution.
+
+
+
+ comment: Accepted at WSDM 2025, 8 pages
+
+
+
+
+
+
+ ♻ ☆ Your Causal Self-Attentive Recommender Hosts a Lonely Neighborhood WSDM'25
+
+
+ In the context of sequential recommendation, a pivotal issue pertains to the
+comparative analysis between bi-directional/auto-encoding (AE) and
+uni-directional/auto-regressive (AR) attention mechanisms, where the
+conclusions regarding architectural and performance superiority remain
+inconclusive. Previous efforts in such comparisons primarily involve
+summarizing existing works to identify a consensus or conducting ablation
+studies on peripheral modeling techniques, such as choices of loss functions.
+However, far fewer efforts have been made in (1) theoretical and (2) extensive
+empirical analysis of the self-attention module, the very pivotal structure on
+which performance and designing insights should be anchored. In this work, we
+first provide a comprehensive theoretical analysis of AE/AR attention matrix in
+the aspect of (1) sparse local inductive bias, a.k.a neighborhood effects, and
+(2) low rank approximation. Analytical metrics reveal that the AR attention
+exhibits sparse neighborhood effects suitable for generally sparse
+recommendation scenarios. Secondly, to support our theoretical analysis, we
+conduct extensive empirical experiments on comparing vanilla and variant AE/AR
+attention on five popular benchmarks with AR performing better overall. Results
+based on adaptive tuning, modularized design and Huggingface are reported.
+Lastly, we shed light on future design choices for performant self-attentive
+recommenders. We make our code and data available at
+https://github.com/yueqirex/Self-Attention-Direction-Check.
+
+
+
+ comment: Accepted to WSDM'25. Updates from the previous version: Added
+ theoretical attention matrix analysis
+
+
+
+
+
+
+ ♻ ☆ UniGLM: Training One Unified Language Model for Text-Attributed Graph
+ Embedding
+
+
+
+
+
+
+
+
+ Yi Fang, Dongzhe Fan, Sirui Ding, Ninghao Liu, Qiaoyu Tan
+
+
+ Representation learning on text-attributed graphs (TAGs), where nodes are
+represented by textual descriptions, is crucial for textual and relational
+knowledge systems and recommendation systems. Currently, state-of-the-art
+embedding methods for TAGs primarily focus on fine-tuning language models
+(e.g., BERT) using structure-aware training signals. While effective, these
+methods are tailored for individual TAG and cannot generalize across various
+graph scenarios. Given the shared textual space, leveraging multiple TAGs for
+joint fine-tuning, aligning text and graph structure from different aspects,
+would be more beneficial. Motivated by this, we introduce a novel Unified Graph
+Language Model (UniGLM) framework, the first graph embedding model that
+generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM
+is trained over multiple TAGs with different domains and scales using
+self-supervised contrastive learning. UniGLM includes an adaptive positive
+sample selection technique for identifying structurally similar nodes and a
+lazy contrastive module that is devised to accelerate training by minimizing
+repetitive encoding calculations. Extensive empirical results across 9
+benchmark TAGs demonstrate UniGLM's efficacy against leading embedding
+baselines in terms of generalization (various downstream tasks and backbones)
+and transfer learning (in and out of domain scenarios). The code is available
+at https://github.com/NYUSHCS/UniGLM.
+
+
+ Recommender systems are widely used in various real-world applications, but
+they often encounter the persistent challenge of the user cold-start problem.
+Cross-domain recommendation (CDR), which leverages user interactions from one
+domain to improve prediction performance in another, has emerged as a promising
+solution. However, users with similar preferences in the source domain may
+exhibit different interests in the target domain. Therefore, directly
+transferring embeddings may introduce irrelevant source-domain collaborative
+information. In this paper, we propose a novel graph-based disentangled
+contrastive learning framework to capture fine-grained user intent and filter
+out irrelevant collaborative information, thereby avoiding negative transfer.
+Specifically, for each domain, we use a multi-channel graph encoder to capture
+diverse user intents. We then construct the affinity graph in the embedding
+space and perform multi-step random walks to capture high-order user similarity
+relationships. Treating one domain as the target, we propose a disentangled
+intent-wise contrastive learning approach, guided by user similarity, to refine
+the bridging of user intents across domains. Extensive experiments on four
+benchmark CDR datasets demonstrate that DisCo consistently outperforms existing
+state-of-the-art baselines, thereby validating the effectiveness of both DisCo
+and its components.
+
+
+
+ comment: Accepted at AAAI 2025
+
+
+
+
+
+
+
+
+
+ Multimedia 12
+
+
+
+
+
+ ☆ A Multimodal Emotion Recognition System: Integrating Facial Expressions,
+ Body Movement, Speech, and Spoken Language
+
+
+ Traditional psychological evaluations rely heavily on human observation and
+interpretation, which are prone to subjectivity, bias, fatigue, and
+inconsistency. To address these limitations, this work presents a multimodal
+emotion recognition system that provides a standardised, objective, and
+data-driven tool to support evaluators, such as psychologists, psychiatrists,
+and clinicians. The system integrates recognition of facial expressions,
+speech, spoken language, and body movement analysis to capture subtle emotional
+cues that are often overlooked in human evaluations. By combining these
+modalities, the system provides more robust and comprehensive emotional state
+assessment, reducing the risk of mis- and overdiagnosis. Preliminary testing in
+a simulated real-world condition demonstrates the system's potential to provide
+reliable emotional insights to improve the diagnostic accuracy. This work
+highlights the promise of automated multimodal analysis as a valuable
+complement to traditional psychological evaluation practices, with applications
+in clinical and therapeutic settings.
+
+
+
+ comment: 10 pages, 6 figures, 3 tables
+
+
+
+
+
+
+ ☆ VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
+
+
+ In this work, we introduce VERSA, a unified and standardized evaluation
+toolkit designed for various speech, audio, and music signals. The toolkit
+features a Pythonic interface with flexible configuration and dependency
+control, making it user-friendly and efficient. With full installation, VERSA
+offers 63 metrics with 711 metric variations based on different configurations.
+These metrics encompass evaluations utilizing diverse external resources,
+including matching and non-matching reference audio, text transcriptions, and
+text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile
+to support the evaluation of a wide range of downstream scenarios. To
+demonstrate its capabilities, this work highlights example use cases for VERSA,
+including audio coding, speech synthesis, speech enhancement, singing
+synthesis, and music generation. The toolkit is available at
+https://github.com/shinjiwlab/versa.
+
+
+
+
+
+
+
+ ☆ ANID: How Far Are We? Evaluating the Discrepancies Between
+ AI-synthesized Images and Natural Images through Multimodal Guidance
+
+
+ In the rapidly evolving field of Artificial Intelligence Generated Content
+(AIGC), one of the key challenges is distinguishing AI-synthesized images from
+natural images. Despite the remarkable capabilities of advanced AI generative
+models in producing visually compelling images, significant discrepancies
+remain when these images are compared to natural ones. To systematically
+investigate and quantify these discrepancies, we introduce an AI-Natural Image
+Discrepancy Evaluation benchmark aimed at addressing the critical question:
+\textit{how far are AI-generated images (AIGIs) from truly realistic images?}
+We have constructed a large-scale multimodal dataset, the Distinguishing
+Natural and AI-generated Images (DNAI) dataset, which includes over 440,000
+AIGI samples generated by 8 representative models using both unimodal and
+multimodal prompts, such as Text-to-Image (T2I), Image-to-Image (I2I), and Text
+\textit{vs.} Image-to-Image (TI2I). Our fine-grained assessment framework
+provides a comprehensive evaluation of the DNAI dataset across five key
+dimensions: naive visual feature quality, semantic alignment in multimodal
+generation, aesthetic appeal, downstream task applicability, and coordinated
+human validation. Extensive evaluation results highlight significant
+discrepancies across these dimensions, underscoring the necessity of aligning
+quantitative metrics with human judgment to achieve a holistic understanding of
+AI-generated image quality. Code is available at
+\href{https://github.com/ryliu68/ANID}{https://github.com/ryliu68/ANID}.
+
+
+
+
+
+
+
+ ☆ Predicting Satisfied User and Machine Ratio for Compressed Images: A
+ Unified Approach
+
+
+ Nowadays, high-quality images are pursued by both humans for better viewing
+experience and by machines for more accurate visual analysis. However, images
+are usually compressed before being consumed, decreasing their quality. It is
+meaningful to predict the perceptual quality of compressed images for both
+humans and machines, which guides the optimization for compression. In this
+paper, we propose a unified approach to address this. Specifically, we create a
+deep learning-based model to predict Satisfied User Ratio (SUR) and Satisfied
+Machine Ratio (SMR) of compressed images simultaneously. We first pre-train a
+feature extractor network on a large-scale SMR-annotated dataset with human
+perception-related quality labels generated by diverse image quality models,
+which simulates the acquisition of SUR labels. Then, we propose an
+MLP-Mixer-based network to predict SUR and SMR by leveraging and fusing the
+extracted multi-layer features. We introduce a Difference Feature Residual
+Learning (DFRL) module to learn more discriminative difference features. We
+further use a Multi-Head Attention Aggregation and Pooling (MHAAP) layer to
+aggregate difference features and reduce their redundancy. Experimental results
+indicate that the proposed model significantly outperforms state-of-the-art SUR
+and SMR prediction methods. Moreover, our joint learning scheme of human and
+machine perceptual quality prediction tasks is effective at improving the
+performance of both.
+
+
+
+
+
+
+
+ ☆ VidCtx: Context-aware Video Question Answering with Image Models
+
+
+
+
+
+
+
+
+ Andreas Goulas, Vasileios Mezaris, Ioannis Patras
+
+
+ To address computational and memory limitations of Large Multimodal Models in
+the Video Question-Answering task, several recent methods extract textual
+representations per frame (e.g., by captioning) and feed them to a Large
+Language Model (LLM) that processes them to produce the final response.
+However, in this way, the LLM does not have access to visual information and
+often has to process repetitive textual descriptions of nearby frames. To
+address those shortcomings, in this paper, we introduce VidCtx, a novel
+training-free VideoQA framework which integrates both modalities, i.e. both
+visual information from input frames and textual descriptions of others frames
+that give the appropriate context. More specifically, in the proposed framework
+a pre-trained Large Multimodal Model (LMM) is prompted to extract at regular
+intervals, question-aware textual descriptions (captions) of video frames.
+Those will be used as context when the same LMM will be prompted to answer the
+question at hand given as input a) a certain frame, b) the question and c) the
+context/caption of an appropriate frame. To avoid redundant information, we
+chose as context the descriptions of distant frames. Finally, a simple yet
+effective max pooling mechanism is used to aggregate the frame-level decisions.
+This methodology enables the model to focus on the relevant segments of the
+video and scale to a high number of frames. Experiments show that VidCtx
+achieves competitive performance among approaches that rely on open models on
+three public Video QA benchmarks, NExT-QA, IntentQA and STAR.
+
+
+
+ comment: Submitted for publication
+
+
+
+
+
+
+ ☆ Modality-Aware Shot Relating and Comparing for Video Scene Detection
+
+
+
+
+
+
+
+
+ Jiawei Tan, Hongxing Wang, Kang Dang, Jiaxin Li, Zhilong Ou
+
+
+ Video scene detection involves assessing whether each shot and its
+surroundings belong to the same scene. Achieving this requires meticulously
+correlating multi-modal cues, $\it{e.g.}$ visual entity and place modalities,
+among shots and comparing semantic changes around each shot. However, most
+methods treat multi-modal semantics equally and do not examine contextual
+differences between the two sides of a shot, leading to sub-optimal detection
+performance. In this paper, we propose the $\bf{M}$odality-$\bf{A}$ware
+$\bf{S}$hot $\bf{R}$elating and $\bf{C}$omparing approach (MASRC), which
+enables relating shots per their own characteristics of visual entity and place
+modalities, as well as comparing multi-shots similarities to have scene changes
+explicitly encoded. Specifically, to fully harness the potential of visual
+entity and place modalities in modeling shot relations, we mine long-term shot
+correlations from entity semantics while simultaneously revealing short-term
+shot correlations from place semantics. In this way, we can learn distinctive
+shot features that consolidate coherence within scenes and amplify
+distinguishability across scenes. Once equipped with distinctive shot features,
+we further encode the relations between preceding and succeeding shots of each
+target shot by similarity convolution, aiding in the identification of scene
+ending shots. We validate the broad applicability of the proposed components in
+MASRC. Extensive experimental results on public benchmark datasets demonstrate
+that the proposed MASRC significantly advances video scene detection.
+
+
+
+
+
+
+
+ ♻ ☆ SwinGS: Sliding Window Gaussian Splatting for Volumetric Video Streaming
+ with Arbitrary Length
+
+
+ Recent advances in 3D Gaussian Splatting (3DGS) have garnered significant
+attention in computer vision and computer graphics due to its high rendering
+speed and remarkable quality. While extant research has endeavored to extend
+the application of 3DGS from static to dynamic scenes, such efforts have been
+consistently impeded by excessive model sizes, constraints on video duration,
+and content deviation. These limitations significantly compromise the
+streamability of dynamic 3D Gaussian models, thereby restricting their utility
+in downstream applications, including volumetric video, autonomous vehicle, and
+immersive technologies such as virtual, augmented, and mixed reality.
+ This paper introduces SwinGS, a novel framework for training, delivering, and
+rendering volumetric video in a real-time streaming fashion. To address the
+aforementioned challenges and enhance streamability, SwinGS integrates
+spacetime Gaussian with Markov Chain Monte Carlo (MCMC) to adapt the model to
+fit various 3D scenes across frames, in the meantime employing a sliding window
+captures Gaussian snapshots for each frame in an accumulative way. We implement
+a prototype of SwinGS and demonstrate its streamability across various datasets
+and scenes. Additionally, we develop an interactive WebGL viewer enabling
+real-time volumetric video playback on most devices with modern browsers,
+including smartphones and tablets. Experimental results show that SwinGS
+reduces transmission costs by 83.6% compared to previous work with ignorable
+compromise in PSNR. Moreover, SwinGS easily scales to long video sequences
+without compromising quality.
+
+
+
+
+
+
+
+ ♻ ☆ Reviewing Intelligent Cinematography: AI research for camera-based video
+ production
+
+
+
+
+
+
+
+
+ Adrian Azzarelli, Nantheera Anantrasirichai, David R Bull
+
+
+ This paper offers the first comprehensive review of artificial intelligence
+(AI) research in the context of real camera content acquisition for
+entertainment purposes and is aimed at both researchers and cinematographers.
+Addressing the lack of review papers in the field of intelligent
+cinematography} (IC) and the breadth of related computer vision research, we
+present a holistic view of the IC landscape while providing technical insight,
+important for experts across disciplines. We provide technical background on
+generative AI, object detection, automated camera calibration and 3-D content
+acquisition, with references to assist non-technical readers. The application
+sections categorize work in terms of four production types: General Production,
+Virtual Production, Live Production and Aerial Production. Within each
+application section, we (1) sub-classify work according to research topic and
+(2) describe the trends and challenges relevant to each type of production. In
+the final chapter, we address the greater scope of IC research and summarize
+the significant potential of this area to influence the creative industries
+sector. We suggest that work relating to virtual production has the greatest
+potential to impact other mediums of production, driven by the growing interest
+in LED volumes/stages for in-camera virtual effects (ICVFX) and automated 3-D
+capture for virtual modeling of real world scenes and actors. We also address
+ethical and legal concerns regarding the use of creative AI that impact on
+artists, actors, technologists and the general public.
+
+
+
+ comment: For researchers and cinematographers. 43 pages including Table of
+ Contents, List of Figures and Tables. We obtained permission to use Figures 5
+ and 11. All other Figures have been drawn by us
+
+
+
+
+
+
+ ♻ ☆ One Framework to Rule Them All: Unifying Multimodal Tasks with LLM
+ Neural-Tuning
+
+
+ Large-scale models have exhibited remarkable capabilities across diverse
+domains, including automated medical services and intelligent customer support.
+However, as most large models are trained on single-modality corpora, enabling
+them to effectively process and understand multimodal signals remains a
+significant challenge. Current research often focuses on designing
+task-specific or scenario-specific tuning strategies, which limits the
+scalability and versatility. To address this limitation, we propose a unified
+framework that concurrently handles multiple tasks and modalities. In this
+framework, all modalities and tasks are represented as unified tokens and
+trained using a single, consistent approach. To enable efficient multitask
+processing, we introduce a novel tuning strategy termed neural tuning, inspired
+by the concept of sparse distributed representation in the human brain, where
+only specific subsets of neurons are activated for each task. Furthermore, to
+advance research in multimodal and multitask learning, we present a new
+benchmark, MMUD, which includes samples annotated with multiple task labels
+spanning reasoning segmentation, referring segmentation, image captioning, and
+text-to-image generation. By applying neural tuning to pretrained large models
+on the MMUD benchmark, we demonstrate the ability to handle multiple tasks
+simultaneously in a streamlined and efficient manner. All models, code, and
+datasets will be released publicly upon publication, fostering further research
+and innovation in this field.
+
+
+
+
+
+
+
+ ♻ ☆ Content Adaptive Front End For Audio Classification
+
+
+ We propose a learnable content adaptive front end for audio signal
+processing. Before the modern advent of deep learning, we used fixed
+representation non-learnable front-ends like spectrogram or mel-spectrogram
+with/without neural architectures. With convolutional architectures supporting
+various applications such as ASR and acoustic scene understanding, a shift to a
+learnable front ends occurred in which both the type of basis functions and the
+weight were learned from scratch and optimized for the particular task of
+interest. With the shift to transformer-based architectures with no
+convolutional blocks present, a linear layer projects small waveform patches
+onto a small latent dimension before feeding them to a transformer
+architecture. In this work, we propose a way of computing a content-adaptive
+learnable time-frequency representation. We pass each audio signal through a
+bank of convolutional filters, each giving a fixed-dimensional vector. It is
+akin to learning a bank of finite impulse-response filterbanks and passing the
+input signal through the optimum filter bank depending on the content of the
+input signal. A content-adaptive learnable time-frequency representation may be
+more broadly applicable, beyond the experiments in this paper.
+
+
+
+ comment: 5 pages, 4 figures. 2023 IEEE International Conference on Acoustics,
+ Speech, and Signal Processing, Rhodes, Greece; Minor Edits
+
+ In this paper, we present our methods and results for the Video-To-Text (VTT)
+task at TRECVid 2024, exploring the capabilities of Vision-Language Models
+(VLMs) like LLaVA and LLaVA-NeXT-Video in generating natural language
+descriptions for video content. We investigate the impact of fine-tuning VLMs
+on VTT datasets to enhance description accuracy, contextual relevance, and
+linguistic consistency. Our analysis reveals that fine-tuning substantially
+improves the model's ability to produce more detailed and domain-aligned text,
+bridging the gap between generic VLM tasks and the specialized needs of VTT.
+Experimental results demonstrate that our fine-tuned model outperforms baseline
+VLMs across various evaluation metrics, underscoring the importance of
+domain-specific tuning for complex VTT tasks.
+
+
+ Content creators often use music to enhance their videos, from soundtracks in
+movies to background music in video blogs and social media content. However,
+identifying the best music for a video can be a difficult and time-consuming
+task. To address this challenge, we propose a novel framework for automatically
+retrieving a matching music clip for a given video, and vice versa. Our
+approach leverages annotated music labels, as well as the inherent artistic
+correspondence between visual and music elements. Distinct from previous
+cross-modal music retrieval works, our method combines both self-supervised and
+supervised training objectives. We use self-supervised and label-supervised
+contrastive learning to train a joint embedding space between music and video.
+We show the effectiveness of our approach by using music genre labels for the
+supervised training component, and our framework can be generalized to other
+music annotations (e.g., emotion, instrument, etc.). Furthermore, our method
+enables fine-grained control over how much the retrieval process focuses on
+self-supervised vs. label information at inference time. We evaluate the
+learned embeddings through a variety of video-to-music and music-to-video
+retrieval tasks. Our experiments show that the proposed approach successfully
+combines self-supervised and supervised objectives and is effective for
+controllable music-video retrieval.
+
+
+
+ comment: Accepted at ICASSP 2025
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 12
+
+
+
+
+
+ ☆ Enhancing Item Tokenization for Generative Recommendation through
+ Self-Improvement
+
+
+ Generative recommendation systems, driven by large language models (LLMs),
+present an innovative approach to predicting user preferences by modeling items
+as token sequences and generating recommendations in a generative manner. A
+critical challenge in this approach is the effective tokenization of items,
+ensuring that they are represented in a form compatible with LLMs. Current item
+tokenization methods include using text descriptions, numerical strings, or
+sequences of discrete tokens. While text-based representations integrate
+seamlessly with LLM tokenization, they are often too lengthy, leading to
+inefficiencies and complicating accurate generation. Numerical strings, while
+concise, lack semantic depth and fail to capture meaningful item relationships.
+Tokenizing items as sequences of newly defined tokens has gained traction, but
+it often requires external models or algorithms for token assignment. These
+external processes may not align with the LLM's internal pretrained
+tokenization schema, leading to inconsistencies and reduced model performance.
+To address these limitations, we propose a self-improving item tokenization
+method that allows the LLM to refine its own item tokenizations during training
+process. Our approach starts with item tokenizations generated by any external
+model and periodically adjusts these tokenizations based on the LLM's learned
+patterns. Such alignment process ensures consistency between the tokenization
+and the LLM's internal understanding of the items, leading to more accurate
+recommendations. Furthermore, our method is simple to implement and can be
+integrated as a plug-and-play enhancement into existing generative
+recommendation systems. Experimental results on multiple datasets and using
+various initial tokenization strategies demonstrate the effectiveness of our
+method, with an average improvement of 8\% in recommendation performance.
+
+
+
+
+
+
+
+ ☆ LLM-based relevance assessment still can't replace human relevance
+ assessment
+
+
+ The use of large language models (LLMs) for relevance assessment in
+information retrieval has gained significant attention, with recent studies
+suggesting that LLM-based judgments provide comparable evaluations to human
+judgments. Notably, based on TREC 2024 data, Upadhyay et al. make a bold claim
+that LLM-based relevance assessments, such as those generated by the UMBRELA
+system, can fully replace traditional human relevance assessments in TREC-style
+evaluations. This paper critically examines this claim, highlighting practical
+and theoretical limitations that undermine the validity of this conclusion.
+First, we question whether the evidence provided by Upadhyay et al. really
+supports their claim, particularly if a test collection is used asa benchmark
+for future improvements. Second, through a submission deliberately intended to
+do so, we demonstrate the ease with which automatic evaluation metrics can be
+subverted, showing that systems designed to exploit these evaluations can
+achieve artificially high scores. Theoretical challenges -- such as the
+inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and
+the potential degradation of future LLM performance -- must be addressed before
+LLM-based relevance assessments can be considered a viable replacement for
+human judgments.
+
+
+
+
+
+
+
+ ☆ Iterative NLP Query Refinement for Enhancing Domain-Specific Information
+ Retrieval: A Case Study in Career Services
+
+
+ Retrieving semantically relevant documents in niche domains poses significant
+challenges for traditional TF-IDF-based systems, often resulting in low
+similarity scores and suboptimal retrieval performance. This paper addresses
+these challenges by introducing an iterative and semi-automated query
+refinement methodology tailored to Humber College's career services webpages.
+Initially, generic queries related to interview preparation yield low
+top-document similarities (approximately 0.2--0.3). To enhance retrieval
+effectiveness, we implement a two-fold approach: first, domain-aware query
+refinement by incorporating specialized terms such as
+resources-online-learning, student-online-services, and career-advising;
+second, the integration of structured educational descriptors like "online
+resume and interview improvement tools." Additionally, we automate the
+extraction of domain-specific keywords from top-ranked documents to suggest
+relevant terms for query expansion. Through experiments conducted on five
+baseline queries, our semi-automated iterative refinement process elevates the
+average top similarity score from approximately 0.18 to 0.42, marking a
+substantial improvement in retrieval performance. The implementation details,
+including reproducible code and experimental setups, are made available in our
+GitHub repositories \url{https://github.com/Elipei88/HumberChatbotBackend} and
+\url{https://github.com/Nisarg851/HumberChatbot}. We also discuss the
+limitations of our approach and propose future directions, including the
+integration of advanced neural retrieval models.
+
+
+
+ comment: To be submitted to CoLM 2025
+
+
+
+
+
+
+ ☆ LLM-Powered User Simulator for Recommender System
+
+
+ User simulators can rapidly generate a large volume of timely user behavior
+data, providing a testing platform for reinforcement learning-based recommender
+systems, thus accelerating their iteration and optimization. However, prevalent
+user simulators generally suffer from significant limitations, including the
+opacity of user preference modeling and the incapability of evaluating
+simulation accuracy. In this paper, we introduce an LLM-powered user simulator
+to simulate user engagement with items in an explicit manner, thereby enhancing
+the efficiency and effectiveness of reinforcement learning-based recommender
+systems training. Specifically, we identify the explicit logic of user
+preferences, leverage LLMs to analyze item characteristics and distill user
+sentiments, and design a logical model to imitate real human engagement. By
+integrating a statistical model, we further enhance the reliability of the
+simulation, proposing an ensemble model that synergizes logical and statistical
+insights for user interaction simulations. Capitalizing on the extensive
+knowledge and semantic generation capabilities of LLMs, our user simulator
+faithfully emulates user behaviors and preferences, yielding high-fidelity
+training data that enrich the training of recommendation algorithms. We
+establish quantifying and qualifying experiments on five datasets to validate
+the simulator's effectiveness and stability across various recommendation
+scenarios.
+
+
+
+
+
+
+
+ ☆ Multifaceted User Modeling in Recommendation: A Federated Foundation
+ Models Approach AAAI25
+
+
+
+
+
+
+
+
+ Chunxu Zhang, Guodong Long, Hongkuan Guo, Zhaojie Liu, Guorui Zhou, Zijian Zhang, Yang Liu, Bo Yang
+
+
+ Multifaceted user modeling aims to uncover fine-grained patterns and learn
+representations from user data, revealing their diverse interests and
+characteristics, such as profile, preference, and personality. Recent studies
+on foundation model-based recommendation have emphasized the Transformer
+architecture's remarkable ability to capture complex, non-linear user-item
+interaction relationships. This paper aims to advance foundation model-based
+recommendersystems by introducing enhancements to multifaceted user modeling
+capabilities. We propose a novel Transformer layer designed specifically for
+recommendation, using the self-attention mechanism to capture sequential
+user-item interaction patterns. Specifically, we design a group gating network
+to identify user groups, enabling hierarchical discovery across different
+layers, thereby capturing the multifaceted nature of user interests through
+multiple Transformer layers. Furthermore, to broaden the data scope and further
+enhance multifaceted user modeling, we extend the framework to a federated
+setting, enabling the use of private datasets while ensuring privacy.
+Experimental validations on benchmark datasets demonstrate the superior
+performance of our proposed method. Code is available.
+
+
+
+ comment: Accepted as a regular paper of AAAI25
+
+
+
+
+
+
+ ☆ Towards a Unified Paradigm: Integrating Recommendation Systems as a New
+ Language in Large Models
+
+
+
+
+
+
+
+
+ Kai Zheng, Qingfeng Sun, Can Xu, Peng Yu, Qingwei Guo
+
+
+ This paper explores the use of Large Language Models (LLMs) for sequential
+recommendation, which predicts users' future interactions based on their past
+behavior. We introduce a new concept, "Integrating Recommendation Systems as a
+New Language in Large Models" (RSLLM), which combines the strengths of
+traditional recommenders and LLMs. RSLLM uses a unique prompting method that
+combines ID-based item embeddings from conventional recommendation models with
+textual item features. It treats users' sequential behaviors as a distinct
+language and aligns the ID embeddings with the LLM's input space using a
+projector. We also propose a two-stage LLM fine-tuning framework that refines a
+pretrained LLM using a combination of two contrastive losses and a language
+modeling loss. The LLM is first fine-tuned using text-only prompts, followed by
+target domain fine-tuning with unified prompts. This trains the model to
+incorporate behavioral knowledge from the traditional sequential recommender
+into the LLM. Our empirical results validate the effectiveness of our proposed
+framework.
+
+
+
+ comment: 13 pages, 5 figures
+
+
+
+
+
+
+ ☆ Enhancing Supply Chain Transparency in Emerging Economies Using Online
+ Contents and LLMs
+
+
+ In the current global economy, supply chain transparency plays a pivotal role
+in ensuring this security by enabling companies to monitor supplier performance
+and fostering accountability and responsibility. Despite the advancements in
+supply chain relationship datasets like Bloomberg and FactSet, supply chain
+transparency remains a significant challenge in emerging economies due to
+issues such as information asymmetry and institutional gaps in regulation. This
+study proposes a novel approach to enhance supply chain transparency in
+emerging economies by leveraging online content and large language models
+(LLMs). We develop a Supply Chain Knowledge Graph Mining System that integrates
+advanced LLMs with web crawler technology to automatically collect and analyze
+supply chain information. The system's effectiveness is validated through a
+case study focusing on the semiconductor supply chain, a domain that has
+recently gained significant attention due to supply chain risks. Our results
+demonstrate that the proposed system provides greater applicability for
+emerging economies, such as mainland China, complementing the data gaps in
+existing datasets. However, challenges including the accurate estimation of
+monetary and material flows, the handling of time series data, synonyms
+disambiguation, and mitigating biases from online contents still remains.
+Future research should focus on addressing these issues to further enhance the
+system's capabilities and broaden its application to other emerging economies
+and industries.
+
+
+ Universal Multimodal Retrieval (UMR) aims to enable search across various
+modalities using a unified model, where queries and candidates can consist of
+pure text, images, or a combination of both. Previous work has attempted to
+adopt multimodal large language models (MLLMs) to realize UMR using only text
+data. However, our preliminary experiments demonstrate that more diverse
+multimodal training data can further unlock the potential of MLLMs. Despite its
+effectiveness, the existing multimodal training data is highly imbalanced in
+terms of modality, which motivates us to develop a training data synthesis
+pipeline and construct a large-scale, high-quality fused-modal training
+dataset. Based on the synthetic training data, we develop the General
+Multimodal Embedder (GME), an MLLM-based dense retriever designed for UMR.
+Furthermore, we construct a comprehensive UMR Benchmark (UMRB) to evaluate the
+effectiveness of our approach. Experimental results show that our method
+achieves state-of-the-art performance among existing UMR methods. Last, we
+provide in-depth analyses of model scaling, training strategies, and perform
+ablation studies on both the model and synthetic data.
+
+
+
+ comment: 32 pages, models at
+ https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct
+
+
+
+
+
+
+ ☆ Joint Knowledge Editing for Information Enrichment and Probability
+ Promotion
+
+
+ Knowledge stored in large language models requires timely updates to reflect
+the dynamic nature of real-world information. To update the knowledge, most
+knowledge editing methods focus on the low layers, since recent probes into the
+knowledge recall process reveal that the answer information is enriched in low
+layers. However, these probes only and could only reveal critical recall stages
+for the original answers, while the goal of editing is to rectify model's
+prediction for the target answers. This inconsistency indicates that both the
+probe approaches and the associated editing methods are deficient. To mitigate
+the inconsistency and identify critical editing regions, we propose a
+contrast-based probe approach, and locate two crucial stages where the model
+behavior diverges between the original and target answers: Information
+Enrichment in low layers and Probability Promotion in high layers. Building
+upon the insights, we develop the Joint knowledge Editing for information
+Enrichment and probability Promotion (JEEP) method, which jointly edits both
+the low and high layers to modify the two critical recall stages. Considering
+the mutual interference and growing forgetting due to dual modifications, JEEP
+is designed to ensure that updates to distinct regions share the same
+objectives and are complementary. We rigorously evaluate JEEP by editing up to
+thousands of facts on various models, i.e., GPT-J (6B) and LLaMA (7B), and
+addressing diverse editing objectives, i.e., adding factual and counterfactual
+knowledge. In all tested scenarios, JEEP achieves best performances, validating
+the effectiveness of the revealings of our probe approach and the designs of
+our editing method. Our code and data are available at
+https://github.com/Eric8932/JEEP.
+
+
+
+
+
+
+
+ ♻ ☆ Utilizing Large Language Models for Information Extraction from Real
+ Estate Transactions
+
+
+ Real estate sales contracts contain crucial information for property
+transactions, but manual data extraction can be time-consuming and error-prone.
+This paper explores the application of large language models, specifically
+transformer-based architectures, for automated information extraction from real
+estate contracts. We discuss challenges, techniques, and future directions in
+leveraging these models to improve efficiency and accuracy in real estate
+contract analysis. We generated synthetic contracts using the real-world
+transaction dataset, thereby fine-tuning the large-language model and achieving
+significant metrics improvements and qualitative improvements in information
+retrieval and reasoning tasks.
+
+
+
+
+
+
+
+ ♻ ☆ CURE4Rec: A Benchmark for Recommendation Unlearning with Deeper
+ Influence NeurIPS 2024
+
+
+ With increasing privacy concerns in artificial intelligence, regulations have
+mandated the right to be forgotten, granting individuals the right to withdraw
+their data from models. Machine unlearning has emerged as a potential solution
+to enable selective forgetting in models, particularly in recommender systems
+where historical data contains sensitive user information. Despite recent
+advances in recommendation unlearning, evaluating unlearning methods
+comprehensively remains challenging due to the absence of a unified evaluation
+framework and overlooked aspects of deeper influence, e.g., fairness. To
+address these gaps, we propose CURE4Rec, the first comprehensive benchmark for
+recommendation unlearning evaluation. CURE4Rec covers four aspects, i.e.,
+unlearning Completeness, recommendation Utility, unleaRning efficiency, and
+recommendation fairnEss, under three data selection strategies, i.e., core
+data, edge data, and random data. Specifically, we consider the deeper
+influence of unlearning on recommendation fairness and robustness towards data
+with varying impact levels. We construct multiple datasets with CURE4Rec
+evaluation and conduct extensive experiments on existing recommendation
+unlearning methods. Our code is released at
+https://github.com/xiye7lai/CURE4Rec.
+
+
+
+ comment: Accepted to NeurIPS 2024, Datasets and Benchmarks. Website:
+ https://oktton.github.io
+
+ For modern recommender systems, the use of low-dimensional latent
+representations to embed users and items based on their observed interactions
+has become commonplace. However, many existing recommendation models are
+primarily designed for coarse-grained and homogeneous interactions, which
+limits their effectiveness in two critical dimensions. Firstly, these models
+fail to leverage the relational dependencies that exist across different types
+of user behaviors, such as page views, collects, comments, and purchases.
+Secondly, they struggle to capture the fine-grained latent factors that drive
+user interaction patterns. To address these limitations, we present a
+heterogeneous graph collaborative filtering model MixRec that excels at
+disentangling users' multi-behavior interaction patterns and uncovering the
+latent intent factors behind each behavior. Our model achieves this by
+incorporating intent disentanglement and multi-behavior modeling, facilitated
+by a parameterized heterogeneous hypergraph architecture. Furthermore, we
+introduce a novel contrastive learning paradigm that adaptively explores the
+advantages of self-supervised data augmentation, thereby enhancing the model's
+resilience against data sparsity and expressiveness with relation
+heterogeneity. To validate the efficacy of MixRec, we conducted extensive
+experiments on three public datasets. The results clearly demonstrate its
+superior performance, significantly outperforming various state-of-the-art
+baselines. Our model is open-sourced and available at:
+https://github.com/HKUDS/MixRec.
+
+
+
+ comment: This paper is accepted by WSDM'2025
+
+
+
+
+
+
+
+
+
+ Multimedia 7
+
+
+
+
+
+ ☆ Modular Conversational Agents for Surveys and Interviews
+
+
+
+
+
+
+
+
+ Jiangbo Yu, Jinhua Zhao, Luis Miranda-Moreno, Matthew Korp
+
+
+ Surveys and interviews (structured, semi-structured, or unstructured) are
+widely used for collecting insights on emerging or hypothetical scenarios.
+Traditional human-led methods often face challenges related to cost,
+scalability, and consistency. Recently, various domains have begun to explore
+the use of conversational agents (chatbots) powered by large language models
+(LLMs). However, as public investments and policies on infrastructure and
+services often involve substantial public stakes and environmental risks, there
+is a need for a rigorous, transparent, privacy-preserving, and cost-efficient
+development framework tailored for such major decision-making processes. This
+paper addresses this gap by introducing a modular approach and its resultant
+parameterized process for designing conversational agents. We detail the system
+architecture, integrating engineered prompts, specialized knowledge bases, and
+customizable, goal-oriented conversational logic in the proposed approach. We
+demonstrate the adaptability, generalizability, and efficacy of our modular
+approach through three empirical studies: (1) travel preference surveys,
+highlighting multimodal (voice, text, and image generation) capabilities; (2)
+public opinion elicitation on a newly constructed, novel infrastructure
+project, showcasing question customization and multilingual (English and
+French) capabilities; and (3) transportation expert consultation about future
+transportation systems, highlighting real-time, clarification request
+capabilities for open-ended questions, resilience in handling erratic inputs,
+and efficient transcript post-processing. The results show the effectiveness of
+this modular approach and how it addresses key ethical, privacy, security, and
+token consumption concerns, setting the stage for the next-generation surveys
+and interviews.
+
+
+
+
+
+
+
+ ☆ InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions
+
+
+
+
+
+
+
+
+ Ronghui Li, Youliang Zhang, Yachao Zhang, Yuxiang Zhang, Mingyang Su, Jie Guo, Ziwei Liu, Yebin Liu, Xiu Li
+
+
+ Humans perform a variety of interactive motions, among which duet dance is
+one of the most challenging interactions. However, in terms of human motion
+generative models, existing works are still unable to generate high-quality
+interactive motions, especially in the field of duet dance. On the one hand, it
+is due to the lack of large-scale high-quality datasets. On the other hand, it
+arises from the incomplete representation of interactive motion and the lack of
+fine-grained optimization of interactions. To address these challenges, we
+propose, InterDance, a large-scale duet dance dataset that significantly
+enhances motion quality, data scale, and the variety of dance genres. Built
+upon this dataset, we propose a new motion representation that can accurately
+and comprehensively describe interactive motion. We further introduce a
+diffusion-based framework with an interaction refinement guidance strategy to
+optimize the realism of interactions progressively. Extensive experiments
+demonstrate the effectiveness of our dataset and algorithm.
+
+
+
+ comment: https://inter-dance.github.io/
+
+
+
+
+
+
+ ☆ Linguistics-Vision Monotonic Consistent Network for Sign Language
+ Production ICASSP 2025
+
+
+
+
+
+
+
+
+ Xu Wang, Shengeng Tang, Peipei Song, Shuo Wang, Dan Guo, Richang Hong
+
+
+ Sign Language Production (SLP) aims to generate sign videos corresponding to
+spoken language sentences, where the conversion of sign Glosses to Poses (G2P)
+is the key step. Due to the cross-modal semantic gap and the lack of
+word-action correspondence labels for strong supervision alignment, the SLP
+suffers huge challenges in linguistics-vision consistency. In this work, we
+propose a Transformer-based Linguistics-Vision Monotonic Consistent Network
+(LVMCN) for SLP, which constrains fine-grained cross-modal monotonic alignment
+and coarse-grained multimodal semantic consistency in language-visual cues
+through Cross-modal Semantic Aligner (CSA) and Multimodal Semantic Comparator
+(MSC). In the CSA, we constrain the implicit alignment between corresponding
+gloss and pose sequences by computing the cosine similarity association matrix
+between cross-modal feature sequences (i.e., the order consistency of
+fine-grained sign glosses and actions). As for MSC, we construct multimodal
+triplets based on paired and unpaired samples in batch data. By pulling closer
+the corresponding text-visual pairs and pushing apart the non-corresponding
+text-visual pairs, we constrain the semantic co-occurrence degree between
+corresponding gloss and pose sequences (i.e., the semantic consistency of
+coarse-grained textual sentences and sign videos). Extensive experiments on the
+popular PHOENIX14T benchmark show that the LVMCN outperforms the
+state-of-the-art.
+
+
+
+ comment: Accepted by ICASSP 2025
+
+
+
+
+
+
+ ☆ AV-DTEC: Self-Supervised Audio-Visual Fusion for Drone Trajectory
+ Estimation and Classification ICRA 2025
+
+
+ The increasing use of compact UAVs has created significant threats to public
+safety, while traditional drone detection systems are often bulky and costly.
+To address these challenges, we propose AV-DTEC, a lightweight self-supervised
+audio-visual fusion-based anti-UAV system. AV-DTEC is trained using
+self-supervised learning with labels generated by LiDAR, and it simultaneously
+learns audio and visual features through a parallel selective state-space
+model. With the learned features, a specially designed plug-and-play
+primary-auxiliary feature enhancement module integrates visual features into
+audio features for better robustness in cross-lighting conditions. To reduce
+reliance on auxiliary features and align modalities, we propose a
+teacher-student model that adaptively adjusts the weighting of visual features.
+AV-DTEC demonstrates exceptional accuracy and effectiveness in real-world
+multi-modality data. The code and trained models are publicly accessible on
+GitHub
+ \url{https://github.com/AmazingDay1/AV-DETC}.
+
+
+
+ comment: Submitted to ICRA 2025
+
+
+
+
+
+
+ ☆ SoundLoc3D: Invisible 3D Sound Source Localization and Classification
+ Using a Multimodal RGB-D Acoustic Camera WACV2025
+
+
+ Accurately localizing 3D sound sources and estimating their semantic labels
+-- where the sources may not be visible, but are assumed to lie on the physical
+surface of objects in the scene -- have many real applications, including
+detecting gas leak and machinery malfunction. The audio-visual weak-correlation
+in such setting poses new challenges in deriving innovative methods to answer
+if or how we can use cross-modal information to solve the task. Towards this
+end, we propose to use an acoustic-camera rig consisting of a pinhole RGB-D
+camera and a coplanar four-channel microphone array~(Mic-Array). By using this
+rig to record audio-visual signals from multiviews, we can use the cross-modal
+cues to estimate the sound sources 3D locations. Specifically, our framework
+SoundLoc3D treats the task as a set prediction problem, each element in the set
+corresponds to a potential sound source. Given the audio-visual
+weak-correlation, the set representation is initially learned from a single
+view microphone array signal, and then refined by actively incorporating
+physical surface cues revealed from multiview RGB-D images. We demonstrate the
+efficiency and superiority of SoundLoc3D on large-scale simulated dataset, and
+further show its robustness to RGB-D measurement inaccuracy and ambient noise
+interference.
+
+
+
+
+
+
+
+
+ Vijul Shah, Brian B. Moser, Ko Watanabe, Andreas Dengel
+
+
+ Capturing pupil diameter is essential for assessing psychological and
+physiological states such as stress levels and cognitive load. However, the low
+resolution of images in eye datasets often hampers precise measurement. This
+study evaluates the impact of various upscaling methods, ranging from bicubic
+interpolation to advanced super-resolution, on pupil diameter predictions. We
+compare several pre-trained methods, including CodeFormer, GFPGAN, Real-ESRGAN,
+HAT, and SRResNet. Our findings suggest that pupil diameter prediction models
+trained on upscaled datasets are highly sensitive to the selected upscaling
+method and scale. Our results demonstrate that upscaling methods consistently
+enhance the accuracy of pupil diameter prediction models, highlighting the
+importance of upscaling in pupilometry. Overall, our work provides valuable
+insights for selecting upscaling techniques, paving the way for more accurate
+assessments in psychological and physiological research.
+
+
+
+
+
+
+
+ ♻ ☆ VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent
+ Video Internet of Things AAAI 2025
+
+
+
+
+
+
+
+
+ Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma
+
+
+ Video Internet of Things (VIoT) has shown full potential in collecting an
+unprecedented volume of video data. How to schedule the domain-specific
+perceiving models and analyze the collected videos uniformly, efficiently, and
+especially intelligently to accomplish complicated tasks is challenging. To
+address the challenge, we build VIoTGPT, the framework based on LLMs to
+correctly interact with humans, query knowledge videos, and invoke vision
+models to analyze multimedia data collaboratively. To support VIoTGPT and
+related future works, we meticulously crafted the VIoT-Tool dataset, including
+the training dataset and the benchmark involving 11 representative vision
+models across three categories based on semi-automatic annotations. To guide
+LLM to act as the intelligent agent towards intelligent VIoT, we resort to the
+ReAct instruction tuning method based on VIoT-Tool to learn the tool
+capability. Quantitative and qualitative experiments and analyses demonstrate
+the effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving
+human-centered experiences in VIoT applications. The project website is
+https://github.com/zhongyy/VIoTGPT.
+
+
+
+ comment: AAAI 2025, 12 pages
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 14
+
+
+
+
+
+ ☆ Towards More Robust Retrieval-Augmented Generation: Evaluating RAG Under
+ Adversarial Poisoning Attacks
+
+
+
+
+
+
+
+
+ Jinyan Su, Jin Peng Zhou, Zhengxin Zhang, Preslav Nakov, Claire Cardie
+
+
+ Retrieval-Augmented Generation (RAG) systems have emerged as a promising
+solution to mitigate LLM hallucinations and enhance their performance in
+knowledge-intensive domains. However, these systems are vulnerable to
+adversarial poisoning attacks, where malicious passages injected into retrieval
+databases can mislead the model into generating factually incorrect outputs. In
+this paper, we investigate both the retrieval and the generation components of
+RAG systems to understand how to enhance their robustness against such attacks.
+From the retrieval perspective, we analyze why and how the adversarial contexts
+are retrieved and assess how the quality of the retrieved passages impacts
+downstream generation. From a generation perspective, we evaluate whether LLMs'
+advanced critical thinking and internal knowledge capabilities can be leveraged
+to mitigate the impact of adversarial contexts, i.e., using skeptical prompting
+as a self-defense mechanism. Our experiments and findings provide actionable
+insights into designing safer and more resilient retrieval-augmented
+frameworks, paving the way for their reliable deployment in real-world
+applications.
+
+
+ Recent advancements in generative AI have flourished the development of
+highly adept Large Language Models (LLMs) that integrate diverse data types to
+empower decision-making. Among these, Multimodal Retrieval-Augmented Generation
+(RAG) applications are promising for their capability to combine the strengths
+of information retrieval and generative models, enhancing their utility across
+various domains, including biomedical research. This paper introduces
+AlzheimerRAG, a Multimodal RAG pipeline tool for biomedical research use cases,
+primarily focusing on Alzheimer's disease from PubMed articles. Our pipeline
+incorporates multimodal fusion techniques to integrate textual and visual data
+processing by efficiently indexing and accessing vast amounts of biomedical
+literature. Preliminary experimental results against benchmarks, such as BioASQ
+and PubMedQA, have returned improved results in information retrieval and
+synthesis of domain-specific information. We also demonstrate a case study with
+our RAG pipeline across different Alzheimer's clinical scenarios. We infer that
+AlzheimerRAG can generate responses with accuracy non-inferior to humans and
+with low rates of hallucination. Overall, a reduction in cognitive task load is
+observed, which allows researchers to gain multimodal insights, improving
+understanding and treatment of Alzheimer's disease.
+
+
+ This paper proposes a novel approach to develop an open-domain and long-form
+Over-The-Top (OTT) Question-Answering (QA) dataset, DragonVerseQA, specifically
+oriented to the fantasy universe of "House of the Dragon" and "Game Of Thrones"
+TV series. Most existing QA datasets focus on short, fact-based answers sourced
+almost solely from Wikipedia articles, devoid of depth and contextual richness
+for sophisticated narrative understanding. We curate a dataset that combines
+full episode summaries sourced from HBO and fandom wiki websites, user reviews
+from sources like IMDb and Rotten Tomatoes, and high-quality, open-domain,
+legally admissible sources, and structured data from repositories like WikiData
+into one dataset. The dataset provides a multi-dimensional context, reflecting
+complex character dynamics and plot developments from these varied sources.
+That means, on equal footing, only after heavy data preprocessing and filtering
+methods will meaningful, non-spam unbiased reviews be available in this
+enriched dataset. The comprehensive insights are given through the long-form
+answers generated from this enriched context. This is what makes this valuable
+dataset for improving conversational AI, narrative analysis, sentiment
+analysis, summarization techniques, and relation extraction.
+ A comparative analysis with state-of-the-art QA datasets such as SQuAD 2.0,
+TriviaQA, and Natural Questions brings to light the unique advantages of our
+dataset in terms of contextual complexity and answer length. Detailed reviews
+add layers to audience sentiment and narrative interpretation, raising the bar
+for domain-specific QA with a new quality benchmark. Our work also allows a
+deeper understanding of entertainment-industry content and opens the door to
+more knowledgeable and creative AI-driven interactions within digital media
+environments.
+
+
+
+
+
+
+
+ ☆ Large Language Model Can Be a Foundation for Hidden Rationale-Based
+ Retrieval ECIR 2025
+
+
+ Despite the recent advancement in Retrieval-Augmented Generation (RAG)
+systems, most retrieval methodologies are often developed for factual
+retrieval, which assumes query and positive documents are semantically similar.
+In this paper, we instead propose and study a more challenging type of
+retrieval task, called hidden rationale retrieval, in which query and document
+are not similar but can be inferred by reasoning chains, logic relationships,
+or empirical experiences. To address such problems, an instruction-tuned Large
+language model (LLM) with a cross-encoder architecture could be a reasonable
+choice. To further strengthen pioneering LLM-based retrievers, we design a
+special instruction that transforms the retrieval task into a generative task
+by prompting LLM to answer a binary-choice question. The model can be
+fine-tuned with direct preference optimization (DPO). The framework is also
+optimized for computational efficiency with no performance degradation. We name
+this retrieval framework by RaHoRe and verify its zero-shot and fine-tuned
+performance superiority on Emotional Support Conversation (ESC), compared with
+previous retrieval works. Our study suggests the potential to employ LLM as a
+foundation for a wider scope of retrieval tasks. Our codes, models, and
+datasets are available on https://github.com/flyfree5/LaHoRe.
+
+
+ Fill-in-the-Middle (FIM) models play a vital role in code completion tasks,
+leveraging both prefix and suffix context to provide more accurate and
+contextually relevant suggestions. This paper presents approaches to improve
+FIM code completion while addressing the challenge of maintaining low latency
+for real-time coding assistance. We enhance FIM code completion by
+incorporating context and curriculum examples in the training process. We
+identify patterns where completion suggestions fail more frequently, revealing
+complexities that smaller language models struggle with. To address these
+challenges, we develop a curriculum dataset by extracting hard-to-complete
+patterns from code repositories and generate context examples using semantic
+and static analysis tools (e.g. TSC compiler). We fine-tune various sized
+models, including StarCoder and DeepSeek, on this enhanced dataset. Our
+evaluation encompasses three key dimensions: the Santa Coder FIM task, the
+Amazon CCEval benchmark, and a new Multi-Line Infilling evaluation benchmark
+derived from SWE-bench. Comprehensive ablation studies across multiple model
+sizes reveal that while all fine-tuned models show improvements, the
+performance gains are more pronounced for smaller parameter models and
+incorporating difficult-to-complete examples, as part of curriculum learning,
+improves the code completion performance. This finding is particularly
+significant given the latency constraints of code completion tasks. While
+larger models like GPT and Claude perform well in multi-line completions but
+are prohibitively challenging to use given high latency, and our fine-tuned
+models achieve a balance between performance and latency. Finally, we validate
+our approach through online A/B testing, demonstrating tangible improvements in
+Completion Acceptance Rate (CAR) and Completion Persistence Rate (CPR), with
+zero latency impact.
+
+
+ The takeaway recommendation system is designed to recommend users' future
+takeaway purchases based on their historical purchase behaviors, thereby
+improving user satisfaction and increasing merchant sales. Existing methods
+focus on incorporating auxiliary information or leveraging knowledge graphs to
+alleviate the sparsity issue of user purchase sequence data. However, two main
+challenges limit the performance of these approaches: (1) how to capture
+dynamic user preferences on complex geospatial information and (2) how to
+efficiently integrate spatial-temporal knowledge from graphs and sequence data
+with low calculation costs. In this paper, we propose a novel spatial-temporal
+knowledge distillation for takeaway recommendation model (STKDRec) based on the
+two-stage training process. Specifically, during the first pre-training stage,
+a spatial-temporal knowledge graph (STKG) encoder is pre-trained to extract the
+high-order spatial-temporal and collaborative associations within the STKG.
+During the second STKD stage, a spatial-temporal Transformer is employed to
+comprehensively model dynamic user preferences on various types of fine-grained
+geospatial information from a sequence perspective. Furthermore, the STKD
+strategy is introduced to adaptively fuse the rich spatial-temporal knowledge
+from the pre-trained STKG encoder and the spatial-temporal transformer while
+reducing the cost of model training. Extensive experiments on three real-world
+datasets show that our STKDRec significantly outperforms the state-of-the-art
+baselines. Our code is available at:https://github.com/Zhaoshuyuan0246/STKDRec.
+
+
+ Graph Neural Networks (GNNs) have exhibited remarkable efficacy in diverse
+graph learning tasks, particularly on static homophilic graphs. Recent
+attention has pivoted towards more intricate structures, encompassing (1)
+static heterophilic graphs encountering the edge heterophily issue in the
+spatial domain and (2) event-based continuous graphs in the temporal domain.
+State-of-the-art (SOTA) has been concurrently addressing these two lines of
+work but tends to overlook the presence of heterophily in the temporal domain,
+constituting the temporal heterophily issue. Furthermore, we highlight that the
+edge heterophily issue and the temporal heterophily issue often co-exist in
+event-based continuous graphs, giving rise to the temporal edge heterophily
+challenge. To tackle this challenge, this paper first introduces the temporal
+edge heterophily measurement. Subsequently, we propose the Temporal
+Heterophilic Graph Convolutional Network (THeGCN), an innovative model that
+incorporates the low/high-pass graph signal filtering technique to accurately
+capture both edge (spatial) heterophily and temporal heterophily. Specifically,
+the THeGCN model consists of two key components: a sampler and an aggregator.
+The sampler selects events relevant to a node at a given moment. Then, the
+aggregator executes message-passing, encoding temporal information, node
+attributes, and edge attributes into node embeddings. Extensive experiments
+conducted on 5 real-world datasets validate the efficacy of THeGCN.
+
+
+ Spoken term detection (STD) is often hindered by reliance on frame-level
+features and the computationally intensive DTW-based template matching,
+limiting its practicality. To address these challenges, we propose a novel
+approach that encodes speech into discrete, speaker-agnostic semantic tokens.
+This facilitates fast retrieval using text-based search algorithms and
+effectively handles out-of-vocabulary terms. Our approach focuses on generating
+consistent token sequences across varying utterances of the same term. We also
+propose a bidirectional state space modeling within the Mamba encoder, trained
+in a self-supervised learning framework, to learn contextual frame-level
+features that are further encoded into discrete tokens. Our analysis shows that
+our speech tokens exhibit greater speaker invariance than those from existing
+tokenizers, making them more suitable for STD tasks. Empirical evaluation on
+LibriSpeech and TIMIT databases indicates that our method outperforms existing
+STD baselines while being more efficient.
+
+
+
+ comment: Accepted at ICASSP 2025
+
+
+
+
+
+
+ ♻ ☆ DaRec: A Disentangled Alignment Framework for Large Language Model and
+ Recommender System
+
+
+ Benefiting from the strong reasoning capabilities, Large language models
+(LLMs) have demonstrated remarkable performance in recommender systems. Various
+efforts have been made to distill knowledge from LLMs to enhance collaborative
+models, employing techniques like contrastive learning for representation
+alignment. In this work, we prove that directly aligning the representations of
+LLMs and collaborative models is sub-optimal for enhancing downstream
+recommendation tasks performance, based on the information theorem.
+Consequently, the challenge of effectively aligning semantic representations
+between collaborative models and LLMs remains unresolved. Inspired by this
+viewpoint, we propose a novel plug-and-play alignment framework for LLMs and
+collaborative models. Specifically, we first disentangle the latent
+representations of both LLMs and collaborative models into specific and shared
+components via projection layers and representation regularization.
+Subsequently, we perform both global and local structure alignment on the
+shared representations to facilitate knowledge transfer. Additionally, we
+theoretically prove that the specific and shared representations contain more
+pertinent and less irrelevant information, which can enhance the effectiveness
+of downstream recommendation tasks. Extensive experimental results on benchmark
+datasets demonstrate that our method is superior to existing state-of-the-art
+algorithms.
+
+
+
+
+
+
+
+ ♻ ☆ When SparseMoE Meets Noisy Interactions: An Ensemble View on Denoising
+ Recommendation ICASSP 2025
+
+
+ Learning user preferences from implicit feedback is one of the core
+challenges in recommendation. The difficulty lies in the potential noise within
+implicit feedback. Therefore, various denoising recommendation methods have
+been proposed recently. However, most of them overly rely on the hyperparameter
+configurations, inevitably leading to inadequacies in model adaptability and
+generalization performance. In this study, we propose a novel Adaptive Ensemble
+Learning (AEL) for denoising recommendation, which employs a sparse gating
+network as a brain, selecting suitable experts to synthesize appropriate
+denoising capacities for different data samples. To address the ensemble
+learning shortcoming of model complexity and ensure sub-recommender diversity,
+we also proposed a novel method that stacks components to create
+sub-recommenders instead of directly constructing them. Extensive experiments
+across various datasets demonstrate that AEL outperforms others in kinds of
+popular metrics, even in the presence of substantial and dynamic noise. Our
+code is available at https://github.com/cpu9xx/AEL.
+
+
+
+ comment: Accepted at ICASSP 2025. 5pages, 4 figures
+
+ By generating new yet effective data, data augmentation has become a
+promising method to mitigate the data sparsity problem in sequential
+recommendation. Existing works focus on augmenting the original data but rarely
+explore the issue of imbalanced relevance and diversity for augmented data,
+leading to semantic drift problems or limited performance improvements. In this
+paper, we propose a novel Balanced data Augmentation Plugin for Sequential
+Recommendation (BASRec) to generate data that balance relevance and diversity.
+BASRec consists of two modules: Single-sequence Augmentation and Cross-sequence
+Augmentation. The former leverages the randomness of the heuristic operators to
+generate diverse sequences for a single user, after which the diverse and the
+original sequences are fused at the representation level to obtain relevance.
+Further, we devise a reweighting strategy to enable the model to learn the
+preferences based on the two properties adaptively. The Cross-sequence
+Augmentation performs nonlinear mixing between different sequence
+representations from two directions. It produces virtual sequence
+representations that are diverse enough but retain the vital semantics of the
+original sequences. These two modules enhance the model to discover
+fine-grained preferences knowledge from single-user and cross-user
+perspectives. Extensive experiments verify the effectiveness of BASRec. The
+average improvement is up to 72.0% on GRU4Rec, 33.8% on SASRec, and 68.5% on
+FMLP-Rec. We demonstrate that BASRec generates data with a better balance
+between relevance and diversity than existing methods. The source code is
+available at https://github.com/KingGugu/BASRec.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ LLMEmb: Large Language Model Can Be a Good Embedding Generator for
+ Sequential Recommendation AAAI'25
+
+
+ Sequential Recommender Systems (SRS), which model a user's interaction
+history to predict the next item of interest, are widely used in various
+applications. However, existing SRS often struggle with low-popularity items, a
+challenge known as the long-tail problem. This issue leads to reduced
+serendipity for users and diminished profits for sellers, ultimately harming
+the overall system. Large Language Model (LLM) has the ability to capture
+semantic relationships between items, independent of their popularity, making
+it a promising solution to this problem. In this paper, we introduce LLMEmb, a
+novel method leveraging LLM to generate item embeddings that enhance SRS
+performance. To bridge the gap between general-purpose LLM and the
+recommendation domain, we propose a Supervised Contrastive Fine-Tuning (SCFT)
+approach. This approach includes attribute-level data augmentation and a
+tailored contrastive loss to make LLM more recommendation-friendly.
+Additionally, we emphasize the importance of integrating collaborative signals
+into LLM-generated embeddings, for which we propose Recommendation Adaptation
+Training (RAT). This further refines the embeddings for optimal use in SRS. The
+LLMEmb-derived embeddings can be seamlessly integrated with any SRS models,
+underscoring the practical value. Comprehensive experiments conducted on three
+real-world datasets demonstrate that LLMEmb significantly outperforms existing
+methods across multiple SRS models. The code for our method is released online
+https://github.com/Applied-Machine-Learning-Lab/LLMEmb.
+
+
+
+ comment: accepted by AAAI'25
+
+
+
+
+
+
+ ♻ ☆ Unleashing the Power of Large Language Models in Zero-shot Relation
+ Extraction via Self-Prompting EMNLP 2024
+
+
+
+
+
+
+
+
+ Siyi Liu, Yang Li, Jiang Li, Shan Yang, Yunshi Lan
+
+
+ Recent research in zero-shot Relation Extraction (RE) has focused on using
+Large Language Models (LLMs) due to their impressive zero-shot capabilities.
+However, current methods often perform suboptimally, mainly due to a lack of
+detailed, context-specific prompts needed for understanding various sentences
+and relations. To address this, we introduce the Self-Prompting framework, a
+novel method designed to fully harness the embedded RE knowledge within LLMs.
+Specifically, our framework employs a three-stage diversity approach to prompt
+LLMs, generating multiple synthetic samples that encapsulate specific relations
+from scratch. These generated samples act as in-context learning samples,
+offering explicit and context-specific guidance to efficiently prompt LLMs for
+RE. Experimental evaluations on benchmark datasets show our approach
+outperforms existing LLM-based zero-shot RE methods. Additionally, our
+experiments confirm the effectiveness of our generation pipeline in producing
+high-quality synthetic data that enhances performance.
+
+
+
+ comment: EMNLP 2024 Short
+
+
+
+
+
+
+ ♻ ☆ Predicting Quality of Video Gaming Experience Using Global-Scale
+ Telemetry Data and Federated Learning
+
+
+
+
+
+
+
+
+ Zhongyang Zhang, Jinhe Wen, Zixi Chen, Dara Arbab, Sruti Sahani, William Lewis, Kent Giard, Bijan Arbab, Haojian Jin, Tauhidur Rahman
+
+
+ Frames Per Second (FPS) significantly affects the gaming experience.
+Providing players with accurate FPS estimates prior to purchase benefits both
+players and game developers. However, we have a limited understanding of how to
+predict a game's FPS performance on a specific device. In this paper, we first
+conduct a comprehensive analysis of a wide range of factors that may affect
+game FPS on a global-scale dataset to identify the determinants of FPS. This
+includes player-side and game-side characteristics, as well as country-level
+socio-economic statistics. Furthermore, recognizing that accurate FPS
+predictions require extensive user data, which raises privacy concerns, we
+propose a federated learning-based model to ensure user privacy. Each player
+and game is assigned a unique learnable knowledge kernel that gradually
+extracts latent features for improved accuracy. We also introduce a novel
+training and prediction scheme that allows these kernels to be dynamically
+plug-and-play, effectively addressing cold start issues. To train this model
+with minimal bias, we collected a large telemetry dataset from 224 countries
+and regions, 100,000 users, and 835 games. Our model achieved a mean
+Wasserstein distance of 0.469 between predicted and ground truth FPS
+distributions, outperforming all baseline methods.
+
+
+
+ comment: 22 pages, 11 figures, 6 tables
+
+
+
+
+
+
+
+
+
+ Multimedia 3
+
+
+
+
+
+ ☆ Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech
+ Translation ICASSP
+
+
+ Audio-Visual Speech-to-Speech Translation typically prioritizes improving
+translation quality and naturalness. However, an equally critical aspect in
+audio-visual content is lip-synchrony-ensuring that the movements of the lips
+match the spoken content-essential for maintaining realism in dubbed videos.
+Despite its importance, the inclusion of lip-synchrony constraints in AVS2S
+models has been largely overlooked. This study addresses this gap by
+integrating a lip-synchrony loss into the training process of AVS2S models. Our
+proposed method significantly enhances lip-synchrony in direct audio-visual
+speech-to-speech translation, achieving an average LSE-D score of 10.67,
+representing a 9.2% reduction in LSE-D over a strong baseline across four
+language pairs. Additionally, it maintains the naturalness and high quality of
+the translated speech when overlaid onto the original video, without any
+degradation in translation quality.
+
+
+ Text-editable and pose-controllable character video generation is a
+challenging but prevailing topic with practical applications. However, existing
+approaches mainly focus on single-object video generation with pose guidance,
+ignoring the realistic situation that multi-character appear concurrently in a
+scenario. To tackle this, we propose a novel multi-character video generation
+framework in a tuning-free manner, which is based on the separated text and
+pose guidance. Specifically, we first extract character masks from the pose
+sequence to identify the spatial position for each generating character, and
+then single prompts for each character are obtained with LLMs for precise text
+guidance. Moreover, the spatial-aligned cross attention and multi-branch
+control module are proposed to generate fine grained controllable
+multi-character video. The visualized results of generating video demonstrate
+the precise controllability of our method for multi-character generation. We
+also verify the generality of our method by applying it to various personalized
+T2I models. Moreover, the quantitative results show that our approach achieves
+superior performance compared with previous works.
+
+
+
+ comment: 5 pages,conference
+
+
+
+
+
+
+ ♻ ☆ Hand1000: Generating Realistic Hands from Text with Only 1,000 Images AAAI 2025
+
+
+ Text-to-image generation models have achieved remarkable advancements in
+recent years, aiming to produce realistic images from textual descriptions.
+However, these models often struggle with generating anatomically accurate
+representations of human hands. The resulting images frequently exhibit issues
+such as incorrect numbers of fingers, unnatural twisting or interlacing of
+fingers, or blurred and indistinct hands. These issues stem from the inherent
+complexity of hand structures and the difficulty in aligning textual
+descriptions with precise visual depictions of hands. To address these
+challenges, we propose a novel approach named Hand1000 that enables the
+generation of realistic hand images with target gesture using only 1,000
+training samples. The training of Hand1000 is divided into three stages with
+the first stage aiming to enhance the model's understanding of hand anatomy by
+using a pre-trained hand gesture recognition model to extract gesture
+representation. The second stage further optimizes text embedding by
+incorporating the extracted hand gesture representation, to improve alignment
+between the textual descriptions and the generated hand images. The third stage
+utilizes the optimized embedding to fine-tune the Stable Diffusion model to
+generate realistic hand images. In addition, we construct the first publicly
+available dataset specifically designed for text-to-hand image generation.
+Based on the existing hand gesture recognition dataset, we adopt advanced image
+captioning models and LLaMA3 to generate high-quality textual descriptions
+enriched with detailed gesture information. Extensive experiments demonstrate
+that Hand1000 significantly outperforms existing models in producing
+anatomically correct hand images while faithfully representing other details in
+the text, such as faces, clothing, and colors.
+
+
+
+
+
+
+
+
+ Amin Bigdeli, Negar Arabzadeh, Ebrahim Bagheri, Charles L. A. Clarke
+
+
+ Recent research has shown that neural information retrieval techniques may be
+susceptible to adversarial attacks. Adversarial attacks seek to manipulate the
+ranking of documents, with the intention of exposing users to targeted content.
+In this paper, we introduce the Embedding Perturbation Rank Attack (EMPRA)
+method, a novel approach designed to perform adversarial attacks on black-box
+Neural Ranking Models (NRMs). EMPRA manipulates sentence-level embeddings,
+guiding them towards pertinent context related to the query while preserving
+semantic integrity. This process generates adversarial texts that seamlessly
+integrate with the original content and remain imperceptible to humans. Our
+extensive evaluation conducted on the widely-used MS MARCO V1 passage
+collection demonstrate the effectiveness of EMPRA against a wide range of
+state-of-the-art baselines in promoting a specific set of target documents
+within a given ranked results. Specifically, EMPRA successfully achieves a
+re-ranking of almost 96% of target documents originally ranked between 51-100
+to rank within the top 10. Furthermore, EMPRA does not depend on surrogate
+models for adversarial text generation, enhancing its robustness against
+different NRMs in realistic settings.
+
+
+
+
+
+
+
+ ☆ HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational
+ Knowledge Bases
+
+
+ Given a semi-structured knowledge base (SKB), where text documents are
+interconnected by relations, how can we effectively retrieve relevant
+information to answer user questions? Retrieval-Augmented Generation (RAG)
+retrieves documents to assist large language models (LLMs) in question
+answering; while Graph RAG (GRAG) uses structured knowledge bases as its
+knowledge source. However, many questions require both textual and relational
+information from SKB - referred to as "hybrid" questions - which complicates
+the retrieval process and underscores the need for a hybrid retrieval method
+that leverages both information. In this paper, through our empirical analysis,
+we identify key insights that show why existing methods may struggle with
+hybrid question answering (HQA) over SKB. Based on these insights, we propose
+HybGRAG for HQA consisting of a retriever bank and a critic module, with the
+following advantages: (1) Agentic, it automatically refines the output by
+incorporating feedback from the critic module, (2) Adaptive, it solves hybrid
+questions requiring both textual and relational information with the retriever
+bank, (3) Interpretable, it justifies decision making with intuitive refinement
+path, and (4) Effective, it surpasses all baselines on HQA benchmarks. In
+experiments on the STaRK benchmark, HybGRAG achieves significant performance
+gains, with an average relative improvement in Hit@1 of 51%.
+
+
+
+
+
+
+
+ ☆ Towards Interpretable Radiology Report Generation via Concept
+ Bottlenecks using a Multi-Agentic RAG ECIR 2025
+
+
+
+
+
+
+
+
+ Hasan Md Tusfiqur Alam, Devansh Srivastav, Md Abdul Kadir, Daniel Sonntag
+
+
+ Deep learning has advanced medical image classification, but interpretability
+challenges hinder its clinical adoption. This study enhances interpretability
+in Chest X-ray (CXR) classification by using concept bottleneck models (CBMs)
+and a multi-agent Retrieval-Augmented Generation (RAG) system for report
+generation. By modeling relationships between visual features and clinical
+concepts, we create interpretable concept vectors that guide a multi-agent RAG
+system to generate radiology reports, enhancing clinical relevance,
+explainability, and transparency. Evaluation of the generated reports using an
+LLM-as-a-judge confirmed the interpretability and clinical utility of our
+model's outputs. On the COVID-QU dataset, our model achieved 81% classification
+accuracy and demonstrated robust report generation performance, with five key
+metrics ranging between 84% and 90%. This interpretable multi-agent framework
+bridges the gap between high-performance AI and the explainability required for
+reliable AI-driven CXR analysis in clinical settings.
+
+
+
+ comment: Accepted in ECIR 2025
+
+
+
+
+
+
+ ☆ Legommenders: A Comprehensive Content-Based Recommendation Library with
+ LLM Support
+
+
+ We present Legommenders, a unique library designed for content-based
+recommendation that enables the joint training of content encoders alongside
+behavior and interaction modules, thereby facilitating the seamless integration
+of content understanding directly into the recommendation pipeline.
+Legommenders allows researchers to effortlessly create and analyze over 1,000
+distinct models across 15 diverse datasets. Further, it supports the
+incorporation of contemporary large language models, both as feature encoder
+and data generator, offering a robust platform for developing state-of-the-art
+recommendation models and enabling more personalized and effective content
+delivery.
+
+
+
+
+
+
+
+ ☆ From General to Specific: Tailoring Large Language Models for
+ Personalized Healthcare
+
+
+
+
+
+
+
+
+ Ruize Shi, Hong Huang, Wei Zhou, Kehan Yin, Kai Zhao, Yun Zhao
+
+
+ The rapid development of large language models (LLMs) has transformed many
+industries, including healthcare. However, previous medical LLMs have largely
+focused on leveraging general medical knowledge to provide responses, without
+accounting for patient variability and lacking true personalization at the
+individual level. To address this, we propose a novel method called
+personalized medical language model (PMLM), which explores and optimizes
+personalized LLMs through recommendation systems and reinforcement learning
+(RL). Specifically, by utilizing self-informed and peer-informed
+personalization, PMLM captures changes in behaviors and preferences to design
+initial personalized prompts tailored to individual needs. We further refine
+these initial personalized prompts through RL, ultimately enhancing the
+precision of LLM guidance. Notably, the personalized prompt are hard prompt,
+which grants PMLM high adaptability and reusability, allowing it to directly
+leverage high-quality proprietary LLMs. We evaluate PMLM using real-world
+obstetrics and gynecology data, and the experimental results demonstrate that
+PMLM achieves personalized responses, and it provides more refined and
+individualized services, offering a potential way for personalized medical
+LLMs.
+
+
+
+
+
+
+
+ ☆ Learned Compression of Nonlinear Time Series With Random Access ICDE 2025
+
+
+
+
+
+
+
+
+ Andrea Guerra, Giorgio Vinciguerra, Antonio Boffa, Paolo Ferragina
+
+
+ Time series play a crucial role in many fields, including finance,
+healthcare, industry, and environmental monitoring. The storage and retrieval
+of time series can be challenging due to their unstoppable growth. In fact,
+these applications often sacrifice precious historical data to make room for
+new data.
+ General-purpose compressors can mitigate this problem with their good
+compression ratios, but they lack efficient random access on compressed data,
+thus preventing real-time analyses. Ad-hoc streaming solutions, instead,
+typically optimise only for compression and decompression speed, while giving
+up compression effectiveness and random access functionality. Furthermore, all
+these methods lack awareness of certain special regularities of time series,
+whose trends over time can often be described by some linear and nonlinear
+functions.
+ To address these issues, we introduce NeaTS, a randomly-accessible
+compression scheme that approximates the time series with a sequence of
+nonlinear functions of different kinds and shapes, carefully selected and
+placed by a partitioning algorithm to minimise the space. The approximation
+residuals are bounded, which allows storing them in little space and thus
+recovering the original data losslessly, or simply discarding them to obtain a
+lossy time series representation with maximum error guarantees.
+ Our experiments show that NeaTS improves the compression ratio of the
+state-of-the-art lossy compressors that use linear or nonlinear functions (or
+both) by up to 14%. Compared to lossless compressors, NeaTS emerges as the only
+approach to date providing, simultaneously, compression ratios close to or
+better than the best existing compressors, a much faster decompression speed,
+and orders of magnitude more efficient random access, thus enabling the storage
+and real-time analysis of massive and ever-growing amounts of (historical) time
+series data.
+
+
+
+ comment: Accepted for publication in Proceedings of the 41st IEEE
+ International Conference on Data Engineering (ICDE 2025)
+
+
+
+
+
+
+ ☆ ASPIRE: Assistive System for Performance Evaluation in IR ECIR
+
+
+ Information Retrieval (IR) evaluation involves far more complexity than
+merely presenting performance measures in a table. Researchers often need to
+compare multiple models across various dimensions, such as the Precision-Recall
+trade-off and response time, to understand the reasons behind the varying
+performance of specific queries for different models. We introduce ASPIRE
+(Assistive System for Performance Evaluation in IR), a visual analytics tool
+designed to address these complexities by providing an extensive and
+user-friendly interface for in-depth analysis of IR experiments. ASPIRE
+supports four key aspects of IR experiment evaluation and analysis:
+single/multi-experiment comparisons, query-level analysis, query
+characteristics-performance interplay, and collection-based retrieval analysis.
+We showcase the functionality of ASPIRE using the TREC Clinical Trials
+collection. ASPIRE is an open-source toolkit available online:
+https://github.com/GiorgosPeikos/ASPIRE
+
+
+
+ comment: Accepted as a demo paper at the 47th European Conference on
+ Information Retrieval (ECIR)
+
+
+
+
+
+
+ ☆ Music Genre Classification: Ensemble Learning with Subcomponents-level
+ Attention
+
+
+ Music Genre Classification is one of the most popular topics in the fields of
+Music Information Retrieval (MIR) and digital signal processing. Deep Learning
+has emerged as the top performer for classifying music genres among various
+methods. The letter introduces a novel approach by combining ensemble learning
+with attention to sub-components, aiming to enhance the accuracy of identifying
+music genres. The core innovation of our work is the proposal to classify the
+subcomponents of the music pieces separately, allowing our model to capture
+distinct characteristics from those sub components. By applying ensemble
+learning techniques to these individual classifications, we make the final
+classification decision on the genre of the music. The proposed method has
+superior advantages in terms of accuracy compared to the other state-of-the-art
+techniques trained and tested on the GTZAN dataset.
+
+
+
+
+
+
+
+ ☆ ADEQA: A Question Answer based approach for joint ADE-Suspect Extraction
+ using Sequence-To-Sequence Transformers
+
+
+
+
+
+
+
+
+ Vinayak Arannil, Tomal Deb, Atanu Roy
+
+
+ Early identification of Adverse Drug Events (ADE) is critical for taking
+prompt actions while introducing new drugs into the market. These ADEs
+information are available through various unstructured data sources like
+clinical study reports, patient health records, social media posts, etc.
+Extracting ADEs and the related suspect drugs using machine learning is a
+challenging task due to the complex linguistic relations between drug ADE pairs
+in textual data and unavailability of large corpus of labelled datasets. This
+paper introduces ADEQA, a question-answer(QA) based approach using quasi
+supervised labelled data and sequence-to-sequence transformers to extract ADEs,
+drug suspects and the relationships between them. Unlike traditional QA models,
+natural language generation (NLG) based models don't require extensive token
+level labelling and thereby reduces the adoption barrier significantly. On a
+public ADE corpus, we were able to achieve state-of-the-art results with an F1
+score of 94% on establishing the relationships between ADEs and the respective
+suspects.
+
+
+
+
+
+
+
+ ☆ PolySmart and VIREO @ TRECVid 2024 Ad-hoc Video Search
+
+
+ This year, we explore generation-augmented retrieval for the TRECVid AVS
+task. Specifically, the understanding of textual query is enhanced by three
+generations, including Text2Text, Text2Image, and Image2Text, to address the
+out-of-vocabulary problem. Using different combinations of them and the rank
+list retrieved by the original query, we submitted four automatic runs. For
+manual runs, we use a large language model (LLM) (i.e., GPT4) to rephrase test
+queries based on the concept bank of the search engine, and we manually check
+again to ensure all the concepts used in the rephrased queries are in the bank.
+The result shows that the fusion of the original and generated queries
+outperforms the original query on TV24 query sets. The generated queries
+retrieve different rank lists from the original query.
+
+
+
+
+
+
+
+ ♻ ☆ A Comparative Study of Text Retrieval Models on DaReCzech
+
+
+
+
+
+
+
+
+ Jakub Stetina, Martin Fajcik, Michal Stefanik, Michal Hradis
+
+
+ This article presents a comprehensive evaluation of 7 off-the-shelf document
+retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and
+Gemma2 chosen to determine their performance on the Czech retrieval dataset
+DaReCzech. The primary objective of our experiments is to estimate the quality
+of modern retrieval approaches in the Czech language. Our analyses include
+retrieval quality, speed, and memory footprint. Secondly, we analyze whether it
+is better to use the model directly in Czech text, or to use machine
+translation into English, followed by retrieval in English. Our experiments
+identify the most effective option for Czech information retrieval. The
+findings revealed notable performance differences among the models, with
+Gemma22 achieving the highest precision and recall, while Contriever performing
+poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency
+and performance.
+
+
+
+
+
+
+
+ ♻ ☆ Advanced Reasoning and Transformation Engine for Multi-Step Insight
+ Synthesis in Data Analytics with Large Language Models
+
+
+ This paper presents the Advanced Reasoning and Transformation Engine for
+Multi-Step Insight Synthesis in Data Analytics (ARTEMIS-DA), a novel framework
+designed to augment Large Language Models (LLMs) for solving complex,
+multi-step data analytics tasks. ARTEMIS-DA integrates three core components:
+the Planner, which dissects complex user queries into structured, sequential
+instructions encompassing data preprocessing, transformation, predictive
+modeling, and visualization; the Coder, which dynamically generates and
+executes Python code to implement these instructions; and the Grapher, which
+interprets generated visualizations to derive actionable insights. By
+orchestrating the collaboration between these components, ARTEMIS-DA
+effectively manages sophisticated analytical workflows involving advanced
+reasoning, multi-step transformations, and synthesis across diverse data
+modalities. The framework achieves state-of-the-art (SOTA) performance on
+benchmarks such as WikiTableQuestions and TabFact, demonstrating its ability to
+tackle intricate analytical tasks with precision and adaptability. By combining
+the reasoning capabilities of LLMs with automated code generation and execution
+and visual analysis, ARTEMIS-DA offers a robust, scalable solution for
+multi-step insight synthesis, addressing a wide range of challenges in data
+analytics.
+
+
+
+
+
+
+
+ ♻ ☆ SAFERec: Self-Attention and Frequency Enriched Model for Next Basket
+ Recommendation
+
+
+ Transformer-based approaches such as BERT4Rec and SASRec demonstrate strong
+performance in Next Item Recommendation (NIR) tasks. However, applying these
+architectures to Next-Basket Recommendation (NBR) tasks, which often involve
+highly repetitive interactions, is challenging due to the vast number of
+possible item combinations in a basket. Moreover, frequency-based methods such
+as TIFU-KNN and UP-CF still demonstrate strong performance in NBR tasks,
+frequently outperforming deep-learning approaches. This paper introduces
+SAFERec, a novel algorithm for NBR that enhances transformer-based
+architectures from NIR by incorporating item frequency information,
+consequently improving their applicability to NBR tasks. Extensive experiments
+on multiple datasets show that SAFERec outperforms all other baselines,
+specifically achieving an 8\% improvement in Recall@10.
+
+
+
+
+
+
+
+
+ Jianlyu Chen, Nan Wang, Chaofan Li, Bo Wang, Shitao Xiao, Han Xiao, Hao Liao, Defu Lian, Zheng Liu
+
+
+ Evaluation plays a crucial role in the advancement of information retrieval
+(IR) models. However, current benchmarks, which are based on predefined domains
+and human-labeled data, face limitations in addressing evaluation needs for
+emerging domains both cost-effectively and efficiently. To address this
+challenge, we propose the Automated Heterogeneous Information Retrieval
+Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1)
+Automated. The testing data in AIR-Bench is automatically generated by large
+language models (LLMs) without human intervention. 2) Heterogeneous. The
+testing data in AIR-Bench is generated with respect to diverse tasks, domains
+and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are
+constantly augmented to provide an increasingly comprehensive evaluation
+benchmark for community developers. We develop a reliable and robust data
+generation pipeline to automatically create diverse and high-quality evaluation
+datasets based on real-world corpora. Our findings demonstrate that the
+generated testing data in AIR-Bench aligns well with human-labeled testing
+data, making AIR-Bench a dependable benchmark for evaluating IR models. The
+resources in AIR-Bench are publicly available at
+https://github.com/AIR-Bench/AIR-Bench.
+
+
+ Negative feedback signals are crucial to guardrail content recommendations
+and improve user experience. When these signals are effectively integrated into
+recommendation systems, they play a vital role in preventing the promotion of
+harmful or undesirable content, thereby contributing to a healthier online
+environment. However, the challenges associated with negative signals are
+noteworthy. Due to the limited visibility of options for users to express
+negative feedback, these signals are often sparse compared to positive signals.
+This imbalance can lead to a skewed understanding of user preferences,
+resulting in recommendations that prioritize short-term engagement over
+long-term satisfaction. Moreover, an over-reliance on positive signals can
+create a filter bubble, where users are continuously exposed to content that
+aligns with their immediate preferences but may not be beneficial in the long
+run. This scenario can ultimately lead to user attrition as audiences become
+disillusioned with the quality of the content provided. Additionally, existing
+user signals frequently fail to meet specific customized requirements, such as
+understanding the underlying reasons for a user's likes or dislikes regarding a
+video. This lack of granularity hinders our ability to tailor content
+recommendations effectively, as we cannot identify the particular attributes of
+content that resonate with individual users.
+
+
+
+ comment: 9 pages, 6 figures
+
+
+
+
+
+
+ ♻ ☆ Non-Random Data Encodes its Geometric and Topological Dimensions
+
+
+
+
+
+
+
+
+ Hector Zenil, Felipe S. Abrahão, Luan C. S. M. Ozelim
+
+
+ Based on the principles of information theory, measure theory, and
+theoretical computer science, we introduce a signal deconvolution method with a
+wide range of applications to coding theory, particularly in zero-knowledge
+one-way communication channels, such as in deciphering messages (i.e., objects
+embedded into multidimensional spaces) from unknown generating sources about
+which no prior knowledge is available and to which no return message can be
+sent. Our multidimensional space reconstruction method from an arbitrary
+received signal is proven to be agnostic vis-\`a-vis the encoding-decoding
+scheme, computation model, programming language, formal theory, the computable
+(or semi-computable) method of approximation to algorithmic complexity, and any
+arbitrarily chosen (computable) probability measure. The method derives from
+the principles of an approach to Artificial General Intelligence (AGI) capable
+of building a general-purpose model of models independent of any arbitrarily
+assumed prior probability distribution. We argue that this optimal and
+universal method of decoding non-random data has applications to signal
+processing, causal deconvolution, topological and geometric properties
+encoding, cryptography, and bio- and technosignature detection.
+
+
+
+ comment: arXiv:2303.16045 is based on this paper. arXiv admin note:
+ substantial text overlap with arXiv:2303.16045
+
+
+
+
+
+
+ ♻ ☆ On User-side Fairness in Negative Sampling for Recommender Systems
+
+
+ Recommender systems are usually trained to discern between positive and
+negative instances for each user. Negative sampling plays an important role in
+selecting informative negative items. Since positive data is disproportionately
+contributed by a minority of active users, negative samplers might be affected
+by data imbalance thus choosing more informative negative items for active
+users. Consequently, users with low participation are further underrepresented
+in the training data, potentially causing subpar treatment from recommenders.
+In this paper we demonstrate empirically that active users receive more
+accurate recommendation than inactive users for state-of-the-art negative
+sampling strategies, and the degree of data imbalance influences the severity
+of performance disparities. We further show that the performance gain brought
+by sampling more negative instances for each positive item is unequally
+distributed across user groups. Generally, active users benefit from
+performance gain whereas inactive users might suffer from performance
+degradation. To address these shortcomings, we propose a group-wise negative
+ratio setup where we use the appropriate smaller negative ratio for inactive
+users and a bigger ratio for active users. Comprehensive experiments show our
+proposed group-wise ratio outperforms a single global ratio in user-side
+fairness and performance improvement.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 2
+
+
+
+
+
+ ☆ Music Genre Classification: Ensemble Learning with Subcomponents-level
+ Attention
+
+
+ Music Genre Classification is one of the most popular topics in the fields of
+Music Information Retrieval (MIR) and digital signal processing. Deep Learning
+has emerged as the top performer for classifying music genres among various
+methods. The letter introduces a novel approach by combining ensemble learning
+with attention to sub-components, aiming to enhance the accuracy of identifying
+music genres. The core innovation of our work is the proposal to classify the
+subcomponents of the music pieces separately, allowing our model to capture
+distinct characteristics from those sub components. By applying ensemble
+learning techniques to these individual classifications, we make the final
+classification decision on the genre of the music. The proposed method has
+superior advantages in terms of accuracy compared to the other state-of-the-art
+techniques trained and tested on the GTZAN dataset.
+
+
+
+
+
+
+
+ ☆ PolySmart @ TRECVid 2024 Medical Video Question Answering
+
+
+ Video Corpus Visual Answer Localization (VCVAL) includes question-related
+video retrieval and visual answer localization in the videos. Specifically, we
+use text-to-text retrieval to find relevant videos for a medical question based
+on the similarity of video transcript and answers generated by GPT4. For the
+visual answer localization, the start and end timestamps of the answer are
+predicted by the alignments on both visual content and subtitles with queries.
+For the Query-Focused Instructional Step Captioning (QFISC) task, the step
+captions are generated by GPT4. Specifically, we provide the video captions
+generated by the LLaVA-Next-Video model and the video subtitles with timestamps
+as context, and ask GPT4 to generate step captions for the given medical query.
+We only submit one run for evaluation and it obtains a F-score of 11.92 and
+mean IoU of 9.6527.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 25
+
+
+
+
+
+ ☆ A Retrieval-Augmented Generation Framework for Academic Literature
+ Navigation in Data Science
+
+
+
+
+
+
+
+
+ Ahmet Yasin Aytar, Kemal Kilic, Kamer Kaya
+
+
+ In the rapidly evolving field of data science, efficiently navigating the
+expansive body of academic literature is crucial for informed decision-making
+and innovation. This paper presents an enhanced Retrieval-Augmented Generation
+(RAG) application, an artificial intelligence (AI)-based system designed to
+assist data scientists in accessing precise and contextually relevant academic
+resources. The AI-powered application integrates advanced techniques, including
+the GeneRation Of BIbliographic Data (GROBID) technique for extracting
+bibliographic information, fine-tuned embedding models, semantic chunking, and
+an abstract-first retrieval method, to significantly improve the relevance and
+accuracy of the retrieved information. This implementation of AI specifically
+addresses the challenge of academic literature navigation. A comprehensive
+evaluation using the Retrieval-Augmented Generation Assessment System (RAGAS)
+framework demonstrates substantial improvements in key metrics, particularly
+Context Relevance, underscoring the system's effectiveness in reducing
+information overload and enhancing decision-making processes. Our findings
+highlight the potential of this enhanced Retrieval-Augmented Generation system
+to transform academic exploration within data science, ultimately advancing the
+workflow of research and innovation in the field.
+
+
+
+
+
+
+
+
+ Austin Stone, Hagen Soltau, Robert Geirhos, Xi Yi, Ye Xia, Bingyi Cao, Kaifeng Chen, Abhijit Ogale, Jonathon Shlens
+
+
+ Visual imagery does not consist of solitary objects, but instead reflects the
+composition of a multitude of fluid concepts. While there have been great
+advances in visual representation learning, such advances have focused on
+building better representations for a small number of discrete objects bereft
+of an understanding of how these objects are interacting. One can observe this
+limitation in representations learned through captions or contrastive learning
+-- where the learned model treats an image essentially as a bag of words.
+Several works have attempted to address this limitation through the development
+of bespoke learned architectures to directly address the shortcomings in
+compositional learning. In this work, we focus on simple, and scalable
+approaches. In particular, we demonstrate that by substantially improving
+weakly labeled data, i.e. captions, we can vastly improve the performance of
+standard contrastive learning approaches. Previous CLIP models achieved near
+chance rate on challenging tasks probing compositional learning. However, our
+simple approach boosts performance of CLIP substantially and surpasses all
+bespoke architectures. Furthermore, we showcase our results on a relatively new
+captioning benchmark derived from DOCCI. We demonstrate through a series of
+ablations that a standard CLIP model trained with enhanced data may demonstrate
+impressive performance on image retrieval tasks.
+
+
+
+
+
+
+
+ ☆ Nano-ESG: Extracting Corporate Sustainability Information from News
+ Articles ECIR 2025
+
+
+ Determining the sustainability impact of companies is a highly complex
+subject which has garnered more and more attention over the past few years.
+Today, investors largely rely on sustainability-ratings from established
+rating-providers in order to analyze how responsibly a company acts. However,
+those ratings have recently been criticized for being hard to understand and
+nearly impossible to reproduce.
+ An independent way to find out about the sustainability practices of
+companies lies in the rich landscape of news article data. In this paper, we
+explore a different approach to identify key opportunities and challenges of
+companies in the sustainability domain. We present a novel dataset of more than
+840,000 news articles which were gathered for major German companies between
+January 2023 and September 2024. By applying a mixture of Natural Language
+Processing techniques, we first identify relevant articles, before summarizing
+them and extracting their sustainability-related sentiment and aspect using
+Large Language Models (LLMs). Furthermore, we conduct an evaluation of the
+obtained data and determine that the LLM-produced answers are accurate. We
+release both datasets at https://github.com/Bailefan/Nano-ESG.
+
+
+
+ comment: To be published at ECIR 2025. Preprint
+
+
+
+
+
+
+
+ Rongqing Kenneth Ong, Andy W. H. Khong
+
+
+ Incorporating multi-modal features as side information has recently become a
+trend in recommender systems. To elucidate user-item preferences, recent
+studies focus on fusing modalities via concatenation, element-wise sum, or
+attention mechanisms. Despite having notable success, existing approaches do
+not account for the modality-specific noise encapsulated within each modality.
+As a result, direct fusion of modalities will lead to the amplification of
+cross-modality noise. Moreover, the variation of noise that is unique within
+each modality results in noise alleviation and fusion being more challenging.
+In this work, we propose a new Spectrum-based Modality Representation (SMORE)
+fusion graph recommender that aims to capture both uni-modal and fusion
+preferences while simultaneously suppressing modality noise. Specifically,
+SMORE projects the multi-modal features into the frequency domain and leverages
+the spectral space for fusion. To reduce dynamic contamination that is unique
+to each modality, we introduce a filter to attenuate and suppress the modality
+noise adaptively while capturing the universal modality patterns effectively.
+Furthermore, we explore the item latent structures by designing a new
+multi-modal graph learning module to capture associative semantic correlations
+and universal fusion patterns among similar items. Finally, we formulate a new
+modality-aware preference module, which infuses behavioral features and
+balances the uni- and multi-modal features for precise preference modeling.
+This empowers SMORE with the ability to infer both user modality-specific and
+fusion preferences more accurately. Experiments on three real-world datasets
+show the efficacy of our proposed model. The source code for this work has been
+made publicly available at https://github.com/kennethorq/SMORE.
+
+
+
+ comment: Accepted to ACM Web Search and Data Mining (WSDM) 2025
+
+ Recent advances in Information Retrieval have leveraged high-dimensional
+embedding spaces to improve the retrieval of relevant documents. Moreover, the
+Manifold Clustering Hypothesis suggests that despite these high-dimensional
+representations, documents relevant to a query reside on a lower-dimensional,
+query-dependent manifold. While this hypothesis has inspired new retrieval
+methods, existing approaches still face challenges in effectively separating
+non-relevant information from relevant signals. We propose a novel methodology
+that addresses these limitations by leveraging information from both relevant
+and non-relevant documents. Our method, ECLIPSE, computes a centroid based on
+irrelevant documents as a reference to estimate noisy dimensions present in
+relevant ones, enhancing retrieval performance. Extensive experiments on three
+in-domain and one out-of-domain benchmarks demonstrate an average improvement
+of up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10
+w.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our
+results pave the way for more robust, pseudo-irrelevance-based retrieval
+systems in future IR research.
+
+
+
+
+
+
+
+ ☆ MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code
+ from UI Designs
+
+
+
+
+
+
+
+
+ Yuxuan Wan, Yi Dong, Jingyu Xiao, Yintong Huo, Wenxuan Wang, Michael R. Lyu
+
+
+ Multi-page websites dominate modern web development. However, existing
+design-to-code methods rely on simplified assumptions, limiting to single-page,
+self-contained webpages without external resource connection. To address this
+gap, we introduce the Multi-Page Resource-Aware Webpage (MRWeb) generation
+task, which transforms UI designs into multi-page, functional web UIs with
+internal/external navigation, image loading, and backend routing. We propose a
+novel resource list data structure to track resources, links, and design
+components. Our study applies existing methods to the MRWeb problem using a
+newly curated dataset of 500 websites (300 synthetic, 200 real-world).
+Specifically, we identify the best metric to evaluate the similarity of the web
+UI, assess the impact of the resource list on MRWeb generation, analyze MLLM
+limitations, and evaluate the effectiveness of the MRWeb tool in real-world
+workflows. The results show that resource lists boost navigation functionality
+from 0% to 66%-80% while facilitating visual similarity. Our proposed metrics
+and evaluation framework provide new insights into MLLM performance on MRWeb
+tasks. We release the MRWeb tool, dataset, and evaluation framework to promote
+further research.
+
+
+
+
+
+
+
+ ☆ ViFactCheck: A New Benchmark Dataset and Methods for Multi-domain News
+ Fact-Checking in Vietnamese AAAI'2025
+
+
+ The rapid spread of information in the digital age highlights the critical
+need for effective fact-checking tools, particularly for languages with limited
+resources, such as Vietnamese. In response to this challenge, we introduce
+ViFactCheck, the first publicly available benchmark dataset designed
+specifically for Vietnamese fact-checking across multiple online news domains.
+This dataset contains 7,232 human-annotated pairs of claim-evidence
+combinations sourced from reputable Vietnamese online news, covering 12 diverse
+topics. It has been subjected to a meticulous annotation process to ensure high
+quality and reliability, achieving a Fleiss Kappa inter-annotator agreement
+score of 0.83. Our evaluation leverages state-of-the-art pre-trained and large
+language models, employing fine-tuning and prompting techniques to assess
+performance. Notably, the Gemma model demonstrated superior effectiveness, with
+an impressive macro F1 score of 89.90%, thereby establishing a new standard for
+fact-checking benchmarks. This result highlights the robust capabilities of
+Gemma in accurately identifying and verifying facts in Vietnamese. To further
+promote advances in fact-checking technology and improve the reliability of
+digital media, we have made the ViFactCheck dataset, model checkpoints,
+fact-checking pipelines, and source code freely available on GitHub. This
+initiative aims to inspire further research and enhance the accuracy of
+information in low-resource languages.
+
+
+
+ comment: Accepted at AAAI'2025 Main Conference
+
+
+
+
+
+
+ ☆ Progressive Multimodal Reasoning via Active Retrieval
+
+
+ Multi-step multimodal reasoning tasks pose significant challenges for
+multimodal large language models (MLLMs), and finding effective ways to enhance
+their performance in such scenarios remains an unresolved issue. In this paper,
+we propose AR-MCTS, a universal framework designed to progressively improve the
+reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo
+Tree Search (MCTS). Our approach begins with the development of a unified
+retrieval module that retrieves key supporting insights for solving complex
+reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in
+automated multimodal reasoning verification, we employ the MCTS algorithm
+combined with an active retrieval mechanism, which enables the automatic
+generation of step-wise annotations. This strategy dynamically retrieves key
+insights for each reasoning step, moving beyond traditional beam search
+sampling to improve the diversity and reliability of the reasoning space.
+Additionally, we introduce a process reward model that aligns progressively to
+support the automatic verification of multimodal reasoning tasks. Experimental
+results across three complex multimodal reasoning benchmarks confirm the
+effectiveness of the AR-MCTS framework in enhancing the performance of various
+multimodal models. Further analysis demonstrates that AR-MCTS can optimize
+sampling diversity and accuracy, yielding reliable multimodal reasoning.
+
+
+
+ comment: Working in progress
+
+
+
+
+
+
+ ☆ Sliding Windows Are Not the End: Exploring Full Ranking with
+ Long-Context Large Language Models
+
+
+ Large Language Models (LLMs) have shown exciting performance in listwise
+passage ranking. Due to the limited input length, existing methods often adopt
+the sliding window strategy. Such a strategy, though effective, is inefficient
+as it involves repetitive and serialized processing, which usually re-evaluates
+relevant passages multiple times. As a result, it incurs redundant API costs,
+which are proportional to the number of inference tokens. The development of
+long-context LLMs enables the full ranking of all passages within a single
+inference, avoiding redundant API costs. In this paper, we conduct a
+comprehensive study of long-context LLMs for ranking tasks in terms of
+efficiency and effectiveness. Surprisingly, our experiments reveal that full
+ranking with long-context LLMs can deliver superior performance in the
+supervised fine-tuning setting with a huge efficiency improvement. Furthermore,
+we identify two limitations of fine-tuning the full ranking model based on
+existing methods: (1) sliding window strategy fails to produce a full ranking
+list as a training label, and (2) the language modeling loss cannot emphasize
+top-ranked passage IDs in the label. To alleviate these issues, we propose a
+new complete listwise label construction approach and a novel importance-aware
+learning objective for full ranking. Experiments show the superior performance
+of our method over baselines. Our codes are available at
+\url{https://github.com/8421BCD/fullrank}.
+
+
+
+ comment: 14 pages
+
+
+
+
+
+
+ ☆ Efficient Self-Supervised Video Hashing with Selective State Spaces AAAI'25
+
+
+
+
+
+
+
+
+ Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia
+
+
+ Self-supervised video hashing (SSVH) is a practical task in video indexing
+and retrieval. Although Transformers are predominant in SSVH for their
+impressive temporal modeling capabilities, they often suffer from computational
+and memory inefficiencies. Drawing inspiration from Mamba, an advanced
+state-space model, we explore its potential in SSVH to achieve a better balance
+between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing
+model with an improved self-supervised learning paradigm. Specifically, we
+design bidirectional Mamba layers for both the encoder and decoder, which are
+effective and efficient in capturing temporal relationships thanks to the
+data-dependent selective scanning mechanism with linear complexity. In our
+learning strategy, we transform global semantics in the feature space into
+semantically consistent and discriminative hash centers, followed by a center
+alignment loss as a global learning signal. Our self-local-global (SLG)
+paradigm significantly improves learning efficiency, leading to faster and
+better convergence. Extensive experiments demonstrate S5VH's improvements over
+state-of-the-art methods, superior transferability, and scalable advantages in
+inference efficiency. Code is available at
+https://github.com/gimpong/AAAI25-S5VH.
+
+
+ Social media constitutes a rich and influential source of information for
+qualitative researchers. Although computational techniques like topic modelling
+assist with managing the volume and diversity of social media content,
+qualitative researcher's lack of programming expertise creates a significant
+barrier to their adoption. In this paper we explore how BERTopic, an advanced
+Large Language Model (LLM)-based topic modelling technique, can support
+qualitative data analysis of social media. We conducted interviews and hands-on
+evaluations in which qualitative researchers compared topics from three
+modelling techniques: LDA, NMF, and BERTopic. BERTopic was favoured by 8 of 12
+participants for its ability to provide detailed, coherent clusters for deeper
+understanding and actionable insights. Participants also prioritised topic
+relevance, logical organisation, and the capacity to reveal unexpected
+relationships within the data. Our findings underscore the potential of
+LLM-based techniques for supporting qualitative analysis.
+
+
+ Multi-behavior recommendation (MBR) has garnered growing attention recently
+due to its ability to mitigate the sparsity issue by inferring user preferences
+from various auxiliary behaviors to improve predictions for the target
+behavior. Although existing research on MBR has yielded impressive results,
+they still face two major limitations. First, previous methods mainly focus on
+modeling fine-grained interaction information between users and items under
+each behavior, which may suffer from sparsity issue. Second, existing models
+usually concentrate on exploiting dependencies between two consecutive
+behaviors, leaving intra- and inter-behavior consistency largely unexplored. To
+the end, we propose a novel approach named Hypergraph Enhanced Cascading Graph
+Convolution Network for multi-behavior recommendation (HEC-GCN). To be
+specific, we first explore both fine- and coarse-grained correlations among
+users or items of each behavior by simultaneously modeling the
+behavior-specific interaction graph and its corresponding hypergraph in a
+cascaded manner. Then, we propose a behavior consistency-guided alignment
+strategy that ensures consistent representations between the interaction graph
+and its associated hypergraph for each behavior, while also maintaining
+representation consistency across different behaviors. Extensive experiments
+and analyses on three public benchmark datasets demonstrate that our proposed
+approach is consistently superior to previous state-of-the-art methods due to
+its capability to effectively attenuate the sparsity issue as well as preserve
+both intra- and inter-behavior consistencies. The code is available at
+https://github.com/marqu22/HEC-GCN.git.
+
+
+
+
+
+
+
+
+ Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin
+
+
+ Generation with source attribution is important for enhancing the
+verifiability of retrieval-augmented generation (RAG) systems. However,
+existing approaches in RAG primarily link generated content to document-level
+references, making it challenging for users to locate evidence among multiple
+content-rich retrieved documents. To address this challenge, we propose
+Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel
+approach that combines answer generation with visual source attribution.
+Leveraging large vision-language models (VLMs), VISA identifies the evidence
+and highlights the exact regions that support the generated answers with
+bounding boxes in the retrieved document screenshots. To evaluate its
+effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia
+webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the
+medical domain. Experimental results demonstrate the effectiveness of VISA for
+visual source attribution on documents' original look, as well as highlighting
+the challenges for improvement. Code, data, and model checkpoints will be
+released.
+
+
+
+
+
+
+
+ ☆ Are Longer Prompts Always Better? Prompt Selection in Large Language
+ Models for Recommendation Systems
+
+
+ In large language models (LLM)-based recommendation systems (LLM-RSs),
+accurately predicting user preferences by leveraging the general knowledge of
+LLMs is possible without requiring extensive training data. By converting
+recommendation tasks into natural language inputs called prompts, LLM-RSs can
+efficiently solve issues that have been difficult to address due to data
+scarcity but are crucial in applications such as cold-start and cross-domain
+problems. However, when applying this in practice, selecting the prompt that
+matches tasks and data is essential. Although numerous prompts have been
+proposed in LLM-RSs and representing the target user in prompts significantly
+impacts recommendation accuracy, there are still no clear guidelines for
+selecting specific prompts.
+ In this paper, we categorize and analyze prompts from previous research to
+establish practical prompt selection guidelines. Through 450 experiments with
+90 prompts and five real-world datasets, we examined the relationship between
+prompts and dataset characteristics in recommendation accuracy. We found that
+no single prompt consistently outperforms others; thus, selecting prompts on
+the basis of dataset characteristics is crucial. Here, we propose a prompt
+selection method that achieves higher accuracy with minimal validation data.
+Because increasing the number of prompts to explore raises costs, we also
+introduce a cost-efficient strategy using high-performance and cost-efficient
+LLMs, significantly reducing exploration costs while maintaining high
+prediction accuracy. Our work offers valuable insights into the prompt
+selection, advancing accurate and efficient LLM-RSs.
+
+
+
+ comment: 15 pages
+
+
+
+
+
+
+ ♻ ☆ ScopeQA: A Framework for Generating Out-of-Scope Questions for RAG
+
+
+ Conversational AI agents use Retrieval Augmented Generation (RAG) to provide
+verifiable document-grounded responses to user inquiries. However, many natural
+questions do not have good answers: about 25\% contain false
+assumptions~\cite{Yu2023:CREPE}, and over 50\% are
+ambiguous~\cite{DBLP:conf/emnlp/MinMHZ20}. RAG agents need high-quality data to
+improve their responses to confusing questions. This paper presents a novel
+guided hallucination-based method to efficiently generate a diverse set of
+borderline out-of-scope confusing questions for a given document corpus. We
+conduct an empirical comparative evaluation of several large language models as
+RAG agents to measure the accuracy of confusion detection and appropriate
+response generation. We contribute a benchmark dataset to the public domain.
+
+
+
+ comment: under review
+
+
+
+
+
+
+ ♻ ☆ Metric Compatible Training for Online Backfilling in Large-Scale
+ Retrieval
+
+
+
+
+
+
+
+
+ Seonguk Seo, Mustafa Gokhan Uzunbas, Bohyung Han, Sara Cao, Ser-Nam Lim
+
+
+ Backfilling is the process of re-extracting all gallery embeddings from
+upgraded models in image retrieval systems. It inevitably requires a
+prohibitively large amount of computational cost and even entails the downtime
+of the service. Although backward-compatible learning sidesteps this challenge
+by tackling query-side representations, this leads to suboptimal solutions in
+principle because gallery embeddings cannot benefit from model upgrades. We
+address this dilemma by introducing an online backfilling algorithm, which
+enables us to achieve a progressive performance improvement during the
+backfilling process while not sacrificing the final performance of new model
+after the completion of backfilling. To this end, we first propose a simple
+distance rank merge technique for online backfilling. Then, we incorporate a
+reverse transformation module for more effective and efficient merging, which
+is further enhanced by adopting a metric-compatible contrastive learning
+approach. These two components help to make the distances of old and new models
+compatible, resulting in desirable merge results during backfilling with no
+extra computational overhead. Extensive experiments show the effectiveness of
+our framework on four standard benchmarks in various settings.
+
+
+ Ontology matching (OM) enables semantic interoperability between different
+ontologies and resolves their conceptual heterogeneity by aligning related
+entities. OM systems currently have two prevailing design paradigms:
+conventional knowledge-based expert systems and newer machine learning-based
+predictive systems. While large language models (LLMs) and LLM agents have
+revolutionised data engineering and have been applied creatively in many
+domains, their potential for OM remains underexplored. This study introduces a
+novel agent-powered LLM-based design paradigm for OM systems. With
+consideration of several specific challenges in leveraging LLM agents for OM,
+we propose a generic framework, namely Agent-OM (Agent for Ontology Matching),
+consisting of two Siamese agents for retrieval and matching, with a set of
+simple OM tools. Our framework is implemented in a proof-of-concept system.
+Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks
+over state-of-the-art OM systems show that our system can achieve results very
+close to the long-standing best performance on simple OM tasks and can
+significantly improve the performance on complex and few-shot OM tasks.
+
+
+
+ comment: 19 pages, 13 figures, 4 tables
+
+
+
+
+
+
+ ♻ ☆ DNS-Rec: Data-aware Neural Architecture Search for Recommender Systems
+
+
+ In the era of data proliferation, efficiently sifting through vast
+information to extract meaningful insights has become increasingly crucial.
+This paper addresses the computational overhead and resource inefficiency
+prevalent in existing Sequential Recommender Systems (SRSs). We introduce an
+innovative approach combining pruning methods with advanced model designs.
+Furthermore, we delve into resource-constrained Neural Architecture Search
+(NAS), an emerging technique in recommender systems, to optimize models in
+terms of FLOPs, latency, and energy consumption while maintaining or enhancing
+accuracy. Our principal contribution is the development of a Data-aware Neural
+Architecture Search for Recommender System (DNS-Rec). DNS-Rec is specifically
+designed to tailor compact network architectures for attention-based SRS
+models, thereby ensuring accuracy retention. It incorporates data-aware gates
+to enhance the performance of the recommendation network by learning
+information from historical user-item interactions. Moreover, DNS-Rec employs a
+dynamic resource constraint strategy, stabilizing the search process and
+yielding more suitable architectural solutions. We demonstrate the
+effectiveness of our approach through rigorous experiments conducted on three
+benchmark datasets, which highlight the superiority of DNS-Rec in SRSs. Our
+findings set a new standard for future research in efficient and accurate
+recommendation systems, marking a significant step forward in this rapidly
+evolving field.
+
+
+
+
+
+
+
+ ♻ ☆ Probability Distribution Learning and Its Application in Deep Learning
+
+
+ This paper introduces a novel theoretical learning framework, termed
+probability distribution learning (PD learning). Departing from the traditional
+statistical learning framework, PD learning focuses on learning the underlying
+probability distribution, which is modeled as a random variable within the
+probability simplex. In this framework, the optimization objective is the
+learning error, which quantifies the posterior expected discrepancy between the
+model's predicted distribution and the underlying true distribution, given
+available sample data and prior knowledge. To optimize the learning error, this
+paper proposes the necessary conditions for loss functions, models, and
+optimization algorithms, ensuring that these conditions are met in real-world
+machine learning scenarios. Based on these conditions, the non-convex
+optimization mechanism corresponding to model training can be theoretically
+resolved. Moreover, this paper provides model-dependent and model-independent
+bounds on learning error, offering new insights into the model's fitting and
+generalization capabilities. Furthermore, the paper applies the PD learning
+framework to elucidate the mechanisms by which various techniques, including
+random parameter initialization, over-parameterization, and dropout, influence
+deep model training. Finally, the paper substantiates the key conclusions of
+the proposed framework through experimental results.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2105.04026 by other
+ authors. arXiv admin note: text overlap with arXiv:2105.04026 by other
+ authors
+
+
+
+
+
+
+ ♻ ☆ Lightning IR: Straightforward Fine-tuning and Inference of
+ Transformer-based Language Models for Information Retrieval WSDM'25
+
+
+ A wide range of transformer-based language models have been proposed for
+information retrieval tasks. However, including transformer-based models in
+retrieval pipelines is often complex and requires substantial engineering
+effort. In this paper, we introduce Lightning IR, an easy-to-use PyTorch
+Lightning-based framework for applying transformer-based language models in
+retrieval scenarios. Lightning IR provides a modular and extensible
+architecture that supports all stages of a retrieval pipeline: from fine-tuning
+and indexing to searching and re-ranking. Designed to be scalable and
+reproducible, Lightning IR is available as open-source:
+https://github.com/webis-de/lightning-ir.
+
+
+
+
+
+
+
+
+ Jin-Yu Liu, Xian-Ling Mao, Tian-Yi Che, Rong-Cheng Tu
+
+
+ Multi-modal hashing methods have gained popularity due to their fast speed
+and low storage requirements. Among them, the supervised methods demonstrate
+better performance by utilizing labels as supervisory signals compared with
+unsupervised methods. Currently, for almost all supervised multi-modal hashing
+methods, there is a hidden assumption that training sets have no noisy labels.
+However, labels are often annotated incorrectly due to manual labeling in
+real-world scenarios, which will greatly harm the retrieval performance. To
+address this issue, we first discover a significant distribution consistency
+pattern through experiments, i.e., the 1-0 distribution of the presence or
+absence of each category in the label is consistent with the high-low
+distribution of similarity scores of the hash codes relative to category
+centers. Then, inspired by this pattern, we propose a novel
+Distribution-Consistency-Guided Multi-modal Hashing (DCGMH), which aims to
+filter and reconstruct noisy labels to enhance retrieval performance.
+Specifically, the proposed method first randomly initializes several category
+centers, which are used to compute the high-low distribution of similarity
+scores; Noisy and clean labels are then separately filtered out via the
+discovered distribution consistency pattern to mitigate the impact of noisy
+labels; Subsequently, a correction strategy, which is indirectly designed via
+the distribution consistency pattern, is applied to the filtered noisy labels,
+correcting high-confidence ones while treating low-confidence ones as unlabeled
+for unsupervised learning, thereby further enhancing the model's performance.
+Extensive experiments on three widely used datasets demonstrate the superiority
+of the proposed method compared to state-of-the-art baselines in multi-modal
+retrieval tasks. The code is available at
+https://github.com/LiuJinyu1229/DCGMH.
+
+
+
+
+
+
+
+ ♻ ☆ DLCRec: A Novel Approach for Managing Diversity in LLM-Based Recommender
+ Systems WSDM 2025
+
+
+ The integration of Large Language Models (LLMs) into recommender systems has
+led to substantial performance improvements. However, this often comes at the
+cost of diminished recommendation diversity, which can negatively impact user
+satisfaction. To address this issue, controllable recommendation has emerged as
+a promising approach, allowing users to specify their preferences and receive
+recommendations that meet their diverse needs. Despite its potential, existing
+controllable recommender systems frequently rely on simplistic mechanisms, such
+as a single prompt, to regulate diversity-an approach that falls short of
+capturing the full complexity of user preferences. In response to these
+limitations, we propose DLCRec, a novel framework designed to enable
+fine-grained control over diversity in LLM-based recommendations. Unlike
+traditional methods, DLCRec adopts a fine-grained task decomposition strategy,
+breaking down the recommendation process into three sequential sub-tasks: genre
+prediction, genre filling, and item prediction. These sub-tasks are trained
+independently and inferred sequentially according to user-defined control
+numbers, ensuring more precise control over diversity. Furthermore, the
+scarcity and uneven distribution of diversity-related user behavior data pose
+significant challenges for fine-tuning. To overcome these obstacles, we
+introduce two data augmentation techniques that enhance the model's robustness
+to noisy and out-of-distribution data. These techniques expose the model to a
+broader range of patterns, improving its adaptability in generating
+recommendations with varying levels of diversity. Our extensive empirical
+evaluation demonstrates that DLCRec not only provides precise control over
+diversity but also outperforms state-of-the-art baselines across multiple
+recommendation scenarios.
+
+
+
+ comment: Accepted by WSDM 2025
+
+
+
+
+
+
+ ♻ ☆ SCONE: A Novel Stochastic Sampling to Generate Contrastive Views and
+ Hard Negative Samples for Recommendation WSDM 2025
+
+
+ Graph-based collaborative filtering (CF) has emerged as a promising approach
+in recommender systems. Despite its achievements, graph-based CF models face
+challenges due to data sparsity and negative sampling. In this paper, we
+propose a novel Stochastic sampling for i) COntrastive views and ii) hard
+NEgative samples (SCONE) to overcome these issues. SCONE generates dynamic
+augmented views and diverse hard negative samples via a unified stochastic
+sampling approach based on score-based generative models. Our extensive
+experiments on 6 benchmark datasets show that SCONE consistently outperforms
+state-of-the-art baselines. SCONE shows efficacy in addressing user sparsity
+and item popularity issues, while enhancing performance for both cold-start
+users and long-tail items. Furthermore, our approach improves the diversity of
+the recommendation and the uniformity of the representations. The code is
+available at https://github.com/jeongwhanchoi/SCONE.
+
+
+
+ comment: Accepted to WSDM 2025. Chaejeong Lee and Jeongwhan Choi are co-first
+ authors with equal contributions
+
+
+
+
+
+
+ ♻ ☆ WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of
+ Large Language Models NeurIPS 2024
+
+
+ Large language models (LLMs) need knowledge updates to meet the ever-growing
+world facts and correct the hallucinated responses, facilitating the methods of
+lifelong model editing. Where the updated knowledge resides in memories is a
+fundamental question for model editing. In this paper, we find that editing
+either long-term memory (direct model parameters) or working memory
+(non-parametric knowledge of neural network activations/representations by
+retrieval) will result in an impossible triangle -- reliability,
+generalization, and locality can not be realized together in the lifelong
+editing settings. For long-term memory, directly editing the parameters will
+cause conflicts with irrelevant pretrained knowledge or previous edits (poor
+reliability and locality). For working memory, retrieval-based activations can
+hardly make the model understand the edits and generalize (poor
+generalization). Therefore, we propose WISE to bridge the gap between memories.
+In WISE, we design a dual parametric memory scheme, which consists of the main
+memory for the pretrained knowledge and a side memory for the edited knowledge.
+We only edit the knowledge in the side memory and train a router to decide
+which memory to go through when given a query. For continual editing, we devise
+a knowledge-sharding mechanism where different sets of edits reside in distinct
+subspaces of parameters, and are subsequently merged into a shared memory
+without conflicts. Extensive experiments show that WISE can outperform previous
+model editing methods and overcome the impossible triangle under lifelong model
+editing of question answering, hallucination, and out-of-distribution settings
+across trending LLM architectures, e.g., GPT, LLaMA, and Mistral. Code is
+available at https://github.com/zjunlp/EasyEdit.
+
+
+ The remarkable capabilities of modern large language models are rooted in
+their vast repositories of knowledge encoded within their parameters, enabling
+them to perceive the world and engage in reasoning. The inner workings of how
+these models store knowledge have long been a subject of intense interest and
+investigation among researchers. To date, most studies have concentrated on
+isolated components within these models, such as the Multilayer Perceptrons and
+attention head. In this paper, we delve into the computation graph of the
+language model to uncover the knowledge circuits that are instrumental in
+articulating specific knowledge. The experiments, conducted with GPT2 and
+TinyLLAMA, have allowed us to observe how certain information heads, relation
+heads, and Multilayer Perceptrons collaboratively encode knowledge within the
+model. Moreover, we evaluate the impact of current knowledge editing techniques
+on these knowledge circuits, providing deeper insights into the functioning and
+constraints of these editing methodologies. Finally, we utilize knowledge
+circuits to analyze and interpret language model behaviors such as
+hallucinations and in-context learning. We believe the knowledge circuits hold
+potential for advancing our understanding of Transformers and guiding the
+improved design of knowledge editing. Code and data are available in
+https://github.com/zjunlp/KnowledgeCircuits.
+
+
+
+ comment: NeurIPS 2024, 26 pages
+
+
+
+
+
+
+
+
+
+ Multimedia 7
+
+
+
+
+
+ ☆ Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned
+ LLM
+
+
+ Text-to-video models have made remarkable advancements through optimization
+on high-quality text-video pairs, where the textual prompts play a pivotal role
+in determining quality of output videos. However, achieving the desired output
+often entails multiple revisions and iterative inference to refine
+user-provided prompts. Current automatic methods for refining prompts encounter
+challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware
+when applied to text-to-video diffusion models. To address these problem, we
+introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video,
+which excels in crafting Video-Centric, Labor-Free and Preference-Aligned
+prompts tailored to specific video diffusion model. Our approach involves a
+meticulously crafted two-stage optimization and alignment system. Initially, we
+conduct a reward-guided prompt evolution pipeline to automatically create
+optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the
+LLM. Then multi-dimensional rewards are employed to generate pairwise data for
+the SFT model, followed by the direct preference optimization (DPO) algorithm
+to further facilitate preference alignment. Through extensive experimentation
+and comparative analyses, we validate the effectiveness of Prompt-A-Video
+across diverse generation models, highlighting its potential to push the
+boundaries of video generation.
+
+
+
+
+
+
+
+ ☆ Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and
+ Semantic Controls
+
+
+
+
+
+
+
+
+ Riccardo Fosco Gramaccioni, Christian Marinoni, Emilian Postolache, Marco Comunità, Luca Cosmo, Joshua D. Reiss, Danilo Comminiello
+
+
+ Sound designers and Foley artists usually sonorize a scene, such as from a
+movie or video game, by manually annotating and sonorizing each action of
+interest in the video. In our case, the intent is to leave full creative
+control to sound designers with a tool that allows them to bypass the more
+repetitive parts of their work, thus being able to focus on the creative
+aspects of sound production. We achieve this presenting Stable-V2A, a two-stage
+model consisting of: an RMS-Mapper that estimates an envelope representative of
+the audio characteristics associated with the input video; and Stable-Foley, a
+diffusion model based on Stable Audio Open that generates audio semantically
+and temporally aligned with the target video. Temporal alignment is guaranteed
+by the use of the envelope as a ControlNet input, while semantic alignment is
+achieved through the use of sound representations chosen by the designer as
+cross-attention conditioning of the diffusion process. We train and test our
+model on Greatest Hits, a dataset commonly used to evaluate V2A models. In
+addition, to test our model on a case study of interest, we introduce Walking
+The Maps, a dataset of videos extracted from video games depicting animated
+characters walking in different locations. Samples and code available on our
+demo page at https://ispamm.github.io/Stable-V2A.
+
+
+
+
+
+
+
+
+ Rongqing Kenneth Ong, Andy W. H. Khong
+
+
+ Incorporating multi-modal features as side information has recently become a
+trend in recommender systems. To elucidate user-item preferences, recent
+studies focus on fusing modalities via concatenation, element-wise sum, or
+attention mechanisms. Despite having notable success, existing approaches do
+not account for the modality-specific noise encapsulated within each modality.
+As a result, direct fusion of modalities will lead to the amplification of
+cross-modality noise. Moreover, the variation of noise that is unique within
+each modality results in noise alleviation and fusion being more challenging.
+In this work, we propose a new Spectrum-based Modality Representation (SMORE)
+fusion graph recommender that aims to capture both uni-modal and fusion
+preferences while simultaneously suppressing modality noise. Specifically,
+SMORE projects the multi-modal features into the frequency domain and leverages
+the spectral space for fusion. To reduce dynamic contamination that is unique
+to each modality, we introduce a filter to attenuate and suppress the modality
+noise adaptively while capturing the universal modality patterns effectively.
+Furthermore, we explore the item latent structures by designing a new
+multi-modal graph learning module to capture associative semantic correlations
+and universal fusion patterns among similar items. Finally, we formulate a new
+modality-aware preference module, which infuses behavioral features and
+balances the uni- and multi-modal features for precise preference modeling.
+This empowers SMORE with the ability to infer both user modality-specific and
+fusion preferences more accurately. Experiments on three real-world datasets
+show the efficacy of our proposed model. The source code for this work has been
+made publicly available at https://github.com/kennethorq/SMORE.
+
+
+
+ comment: Accepted to ACM Web Search and Data Mining (WSDM) 2025
+
+
+
+
+
+
+ ☆ Efficient Self-Supervised Video Hashing with Selective State Spaces AAAI'25
+
+
+
+
+
+
+
+
+ Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia
+
+
+ Self-supervised video hashing (SSVH) is a practical task in video indexing
+and retrieval. Although Transformers are predominant in SSVH for their
+impressive temporal modeling capabilities, they often suffer from computational
+and memory inefficiencies. Drawing inspiration from Mamba, an advanced
+state-space model, we explore its potential in SSVH to achieve a better balance
+between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing
+model with an improved self-supervised learning paradigm. Specifically, we
+design bidirectional Mamba layers for both the encoder and decoder, which are
+effective and efficient in capturing temporal relationships thanks to the
+data-dependent selective scanning mechanism with linear complexity. In our
+learning strategy, we transform global semantics in the feature space into
+semantically consistent and discriminative hash centers, followed by a center
+alignment loss as a global learning signal. Our self-local-global (SLG)
+paradigm significantly improves learning efficiency, leading to faster and
+better convergence. Extensive experiments demonstrate S5VH's improvements over
+state-of-the-art methods, superior transferability, and scalable advantages in
+inference efficiency. Code is available at
+https://github.com/gimpong/AAAI25-S5VH.
+
+
+
+
+
+
+
+ ☆ Bridging the Data Provenance Gap Across Text, Speech and Video
+
+
+
+
+
+
+
+
+ Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
+
+
+ Progress in AI is driven largely by the scale and quality of training data.
+Despite this, there is a deficit of empirical analysis examining the attributes
+of well-established datasets beyond text. In this work we conduct the largest
+and first-of-its-kind longitudinal audit across modalities--popular text,
+speech, and video datasets--from their detailed sourcing trends and use
+restrictions to their geographical and linguistic representation. Our manual
+analysis covers nearly 4000 public datasets between 1990-2024, spanning 608
+languages, 798 sources, 659 organizations, and 67 countries. We find that
+multimodal machine learning applications have overwhelmingly turned to
+web-crawled, synthetic, and social media platforms, such as YouTube, for their
+training sets, eclipsing all other sources since 2019. Secondly, tracing the
+chain of dataset derivations we find that while less than 33% of datasets are
+restrictively licensed, over 80% of the source content in widely-used text,
+speech, and video datasets, carry non-commercial restrictions. Finally, counter
+to the rising number of languages and geographies represented in public AI
+training datasets, our audit demonstrates measures of relative geographical and
+multilingual representation have failed to significantly improve their coverage
+since 2013. We believe the breadth of our audit enables us to empirically
+examine trends in data sourcing, restrictions, and Western-centricity at an
+ecosystem-level, and that visibility into these questions are essential to
+progress in responsible AI. As a contribution to ongoing improvements in
+dataset transparency and responsible use, we release our entire multimodal
+audit, allowing practitioners to trace data provenance across text, speech, and
+video.
+
+
+
+
+
+
+
+
+ Jinzheng Zhao, Yong Xu, Xinyuan Qian, Davide Berghi, Peipei Wu, Meng Cui, Jianyuan Sun, Philip J. B. Jackson, Wenwu Wang
+
+
+ Audio-visual speaker tracking has drawn increasing attention over the past
+few years due to its academic values and wide application. Audio and visual
+modalities can provide complementary information for localization and tracking.
+With audio and visual information, the Bayesian-based filter can solve the
+problem of data association, audio-visual fusion and track management. In this
+paper, we conduct a comprehensive overview of audio-visual speaker tracking. To
+our knowledge, this is the first extensive survey over the past five years. We
+introduce the family of Bayesian filters and summarize the methods for
+obtaining audio-visual measurements. In addition, the existing trackers and
+their performance on AV16.3 dataset are summarized. In the past few years, deep
+learning techniques have thrived, which also boosts the development of audio
+visual speaker tracking. The influence of deep learning techniques in terms of
+measurement extraction and state estimation is also discussed. At last, we
+discuss the connections between audio-visual speaker tracking and other areas
+such as speech separation and distributed speaker tracking.
+
+
+
+
+
+
+
+ ♻ ☆ Sign-IDD: Iconicity Disentangled Diffusion for Sign Language Production AAAI 2025
+
+
+
+
+
+
+
+
+ Shengeng Tang, Jiayi He, Dan Guo, Yanyan Wei, Feng Li, Richang Hong
+
+
+ Sign Language Production (SLP) aims to generate semantically consistent sign
+videos from textual statements, where the conversion from textual glosses to
+sign poses (G2P) is a crucial step. Existing G2P methods typically treat sign
+poses as discrete three-dimensional coordinates and directly fit them, which
+overlooks the relative positional relationships among joints. To this end, we
+provide a new perspective, constraining joint associations and gesture details
+by modeling the limb bones to improve the accuracy and naturalness of the
+generated poses. In this work, we propose a pioneering iconicity disentangled
+diffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD
+incorporates a novel Iconicity Disentanglement (ID) module to bridge the gap
+between relative positions among joints. The ID module disentangles the
+conventional 3D joint representation into a 4D bone representation, comprising
+the 3D spatial direction vector and 1D spatial distance vector between adjacent
+joints. Additionally, an Attribute Controllable Diffusion (ACD) module is
+introduced to further constrain joint associations, in which the attribute
+separation layer aims to separate the bone direction and length attributes, and
+the attribute control layer is designed to guide the pose generation by
+leveraging the above attributes. The ACD module utilizes the gloss embeddings
+as semantic conditions and finally generates sign poses from noise embeddings.
+Extensive experiments on PHOENIX14T and USTC-CSL datasets validate the
+effectiveness of our method. The code is available at:
+https://github.com/NaVi-start/Sign-IDD.
+
+