System translated (Gemini)

🤖 AI 速览

The key signal today is that AI agents are moving from “capable of execution” to “governable”: runtime strategies, active clarification, and multi-agent supervision are becoming research priorities. Meanwhile, AI programming tools are accelerating into the closed-loop …

📋 文章元数据

发布时间: 2026-06-21
类型: ai-daily
字数: 5673
阅读时长: 27 min

2026-06-21 AI Daily | Agents Enter an Era of Governance, AI Programming Moves Towards Closed-Loop Execution Link to heading

Today’s key signal is that AI agents are evolving from “capable of execution” to “governable,” with runtime policies, proactive clarification, and multi-agent supervision becoming research priorities. Meanwhile, AI programming tools are rapidly entering a phase of closed-loop execution and collaborative workflows. Products like Codex and Claude Code are starting to reshape the development process around task migration, action replication, and team-based visual delivery. Edge and small models are also demonstrating greater practical value in vertical-specific scenarios.

📖 Deep Dive: This Issue’s Watch List Link to heading

The most noteworthy theme today is “agent governance.” Several papers, covering topics from runtime obligation/prohibition policies and implicit anchors in multi-agent deliberations to DeFi risk supervision and proactive clarification mechanisms, all point to a single issue: agents must not only know how to perform tasks but also when to stop, ask, and report.

The second theme is the growing focus on “LLM reliability assessment.” Research into cognitive blind spots in clinical tabular data, visualization of hidden biases, and classification of RTL hardware code failures is pushing evaluation beyond simple right-or-wrong results to the boundaries of uncertainty, bias, and generalization.

In model architecture, DeepSeek-V4’s million-token context, experimental analysis of diffusion language models, and the ITNet unified architecture are worth following for technical teams. They represent three different evolutionary directions: long context, non-autoregressive generation, and unification of fundamental operators.

🌐 AI Hot Topics on X Link to heading

Topic 1: Loop Engineering Ushers in Autonomous AI Coding Era Link to heading

Category: AI · News
Overview: Trending Time: , Related Posts: 42
What it is: The topic “Loop Engineering” has gained attention on X, focusing on enabling AI programming agents to come closer to autonomously completing software development tasks through continuous feedback, automated testing, and iterative execution.
Why it matters: This is seen as a key step in the evolution of AI programming from “assistant-level completion” to “autonomous engineering execution,” potentially transforming software development workflows, R&D efficiency, and the division of labor between developers and AI tools.
Discussion summary: The discussion on X centers on the reliability of autonomous coding agents. Supporters believe that closed-loop feedback and automated validation can significantly improve code quality, while critics worry about error accumulation in complex projects, security risks, accountability, and the impact on developer jobs.

Topic 2: Z.ai’s GLM-5.2 Tops Open-Weight AI Leaderboards Link to heading

Category: AI · News
Overview: Trending Time: 6 hours ago, Related Posts: 3,800
What it is: Z.ai’s release of GLM-5.2 has achieved top rankings on several open-weight AI model leaderboards, attracting industry attention.
Why it matters: This indicates that open-source or open-weight models continue to close the gap with top-tier closed-source models in reasoning, coding, and general capabilities, potentially accelerating the adoption of locally deployable and customizable AI systems by enterprises and developers.
Discussion summary: Discussions on X focus on the true capabilities of GLM-5.2, the representativeness of its benchmarks, the gap between it and other open models like Llama, Qwen, and DeepSeek, and whether open-weight models will further erode the advantages of closed-source models.

Topic 3: Z.ai’s GLM-5.2 Tops Open Models, Matches Top Closed AIs in Coding Link to heading

Category: AI · News
Overview: Trending Time: 2 days ago, Related Posts: 33,000
What it is: Z.ai’s GLM-5.2 is reported to have achieved leading performance among open models and to be approaching the coding capabilities of top-tier closed-source AIs.
Why it matters: If the benchmarks and real-world performance hold up, it means open models are further narrowing the gap with closed-source models in high-value scenarios like code generation and software engineering assistance. This could drive developer adoption, enterprise deployment, and competition in the model ecosystem.
Discussion summary: The discussion on X focuses on the reliability of GLM-5.2’s programming benchmarks, whether its real-world project performance can match the hype, the cost and controllability advantages of open models versus closed-source ones, and the rapid rise of Chinese AI companies in the open-source model race.

Topic 4: UC Berkeley’s PixelRAG Reads Web Pages from Screenshots Link to heading

Category: AI · News
Overview: Trending Time: 5 hours ago, Related Posts: 326
What it is: A team from UC Berkeley has introduced PixelRAG, a multimodal RAG method that can directly read and retrieve information from webpage screenshots.
Why it matters: This shows that AI systems can bypass structured web text and understand page content based on the visual interface, which could enhance the capabilities of browser agents, web automation, and Q&A on complex interfaces.
Discussion summary: Discussions on X are focused on whether PixelRAG can bring AI closer to the human way of browsing the web and its practicality in web agents. There is also interest in the efficiency, accuracy, and scalability of screenshot-based retrieval and its advantages over traditional DOM/text retrieval methods.

Today’s AI Public Opinion Summary on X Link to heading

Today’s public opinion primarily focuses on AI transitioning from “generating answers” to “executing tasks”: on one hand, Loop Engineering attempts to push programming agents towards more autonomous software engineering execution through feedback, testing, and iteration; on the other hand, PixelRAG enables agents to understand web pages more human-like from visual interfaces. The consensus is that open models and multimodal agent capabilities are rapidly approaching practical thresholds, especially with GLM-5.2’s performance reinforcing the judgment that open-weight models are narrowing the gap with closed-source models. Disagreements mainly revolve around “whether benchmark capabilities equate to real-world capabilities”: supporters value the efficiency gains from cost, controllability, local deployment, and automated verification, while skeptics worry about insufficient reliability in complex projects, real web pages, and long-term tasks. Potential risks include over-interpreting benchmarks, errors accumulating in automated closed loops, unclear safety and responsibility boundaries, and the too-rapid reshaping of developer roles and enterprise technology roadmaps.

💡 Influencer Insights Link to heading

Here is an industry analysis report based on tweet content from the past 24 hours:

1. Today’s Key Technology Trends and Product Hotspots Followed by Influencers Link to heading

🔥 Hotspot One: AI Programming Enters the Deep End of “Full Automation” and “Collaborative Flow” Today’s discussion focuses on evolving AI programming from “writing code” to “complete work delivery and collaboration.”

Cross-device Task Migration: Influencers are generally concerned about OpenAI Codex’s Handoff feature. @dotey points out that this goes beyond simple conversation syncing, achieving complete context migration, including uncommitted Git states, between local and cloud, allowing developers to keep agents working even when commuting or away from their workstations.
Operation Replication and Automation: Codex’s Record & Replay is seen as a super evolution of RPA (Robotic Process Automation). @AI_Jasonyu exclaimed it’s a combination of “super version RPA + key macro + Computer Use”; @dotey believes it solves the pain point of “writing manuals being too troublesome” – by simply demonstrating a tedious reimbursement or publishing process once, AI can generate reusable Skills. @vista8 also mentioned integrating Codex with ChatGPT via MCP, achieving “double quota” and the ability to use GPT-5.5 Pro for top-level planning.
Visual Collaboration and Architecture: Claude Code’s Artifacts feature received a detailed breakdown from @dotey. He believes this feature addresses the pain point where terminal session results were only visible to the operator, turning debugging timelines and system architecture descriptions directly into shared web pages that can be updated in real-time, greatly enhancing team collaboration efficiency.

🔥 Hotspot Two: Real-world Validation of Lightweight Models and Edge AI Influencers are no longer merely discussing edge concepts but delving into in-depth practical comparisons and implementation discussions.

The “Sweet Spot” Battle for Edge Models: @zhixianio conducted a rigorous “ascetic” test, stating that Qwen3.6-35B-A3B’s response speed and “IQ” on Mac already surpass remote LLMs. At the same time, he conducted an in-depth test of the highly acclaimed Gemma 4 12B Coder in the community, finding that it performed significantly worse than 35B-level Qwen when faced with complex engineering tasks (such as Tetris, Three.js special effects), limited by its 12B parameter ceiling.
Explosion of Tiny Models and Application Diversification: @AI_Jasonyu observed the phenomenal performance of PP-OCRv6, a 1.5MB model whose browser-side recognition accuracy surpassed giants like GPT-5.5. He pointed out that for specific tasks with clear vertical boundaries, cleverly designed small models are reclaiming the “jobs” of large models, which also indirectly corroborates the importance of Google QAT (Quantization Aware Training) for edge devices, as mentioned by @zhixianio.
Edge Breakthroughs in Video and Voice: @zhixianio’s practical test of MiniCPM-o 4.5, a 9B multimodal model, showed considerable satisfaction with its audio and video full-duplex capabilities, indicating that small-parameter models’ multimodal interaction abilities are rapidly climbing.

🔥 Hotspot Three: Vibe Coding Paradigm Reflection and Toolchain Maturation Discussions on development standardization are exceptionally lively.

From Vibe Coding to Contract First: @Pluvio9yte shared his journey from a security practitioner to a full-stack developer, proposing that the best practice for AI development is neither purely demand-driven nor blindly code-driven, but “Contract First”, meaning defining contracts through interfaces, data models, etc., in advance, serving as a stable reference for both humans and AI.
Return of Software Engineering Common Sense: @dotey systematically responded to the issue of unstable AI code, emphasizing that requirements analysis, system design, code review, and grayscale release are not only not to be skipped in the AI era, but are even more crucial. He reminded developers not to throw everything into AGENTS.md, but to distinguish what should rely on rule documents and what should be defended by automated tests.
The Game of AI Code Review: @dotey joked about “poisoning” open-source projects through prompt injection to phish for developers who submit PRs without reviewing the code, sparking a discussion on AI ethics and the necessity of human oversight.

2. Noteworthy Unique Perspectives and Industry Foresight Link to heading

Renewed Discussion on the “Causal Large Model” Path: @Pluvio9yte provided a deep analysis of Professor Biwei Huang’s team’s Aether AI. He argues that current LLMs are still at the “data correlation” level (e.g., not knowing that a cup with a hole will leak), and the next generation of AI should evolve towards Causal World Models that understand the mechanisms of the physical world. This is a key factor in advancing AI from probabilistic prediction to logical rigor.
“Perceived Obsolescence” and the Productivity Paradox in the AI Era: Designer @nishuang proposed that Apple’s frequent innovations are a strategy of “perceived obsolescence,” compelling consumers to feel their old devices are outdated. Relatedly, @ruanyf relayed a hot topic from Hacker News: “Since AI boosts efficiency, allowing work to be completed in hours, should we have Fridays off?” He pointed out that without time off or pay raises, AI’s value to employees is questionable, and he poignantly raised the hiring dilemma of “how to interview a programmer whose code is written by AI,” challenging traditional technical hiring standards.
First-Mover Advantage and Gray-Hat Practices in Data Traffic: @gefei55 offered a forward-thinking perspective: waiting for a trend to show up on Google Trends is already lagging. A true growth hacker should use AI to monitor highly-liked posts with links on social media (like X) and preemptively deploy landing pages when a concept is just emerging, before search popularity has formed. He also revealed the specific techniques and vulnerabilities of using scripts to inflate Similarweb traffic rankings to deceive investors (e.g., an extremely low bounce rate ironically becomes evidence of manipulation).
The “Actionable Triangle” of AI-Assisted Education: @lijigang framed AI’s contribution to children’s education into three actionable entry points: Media Transformation (understanding knowledge through multimodality), Difficulty Adaptation (generating problems appropriate for the zone of proximal development), and Constructive Output (turning lessons into a shareable game or webpage to create a positive feedback loop), demonstrating a highly practical and forward-thinking approach.

3. Recommended Tools and Resources Link to heading

💻 AI Development and Programming Tools

Meta Skill (Meta Skill Builder): @vista8 strongly recommends Meta Skill 2.0, polished for a month by @yaojingang. He claims it is more powerful than the official builder, incorporating leaked source code techniques from Anthropic, and allows non-coders to create a 90-point quality Skill. (GitHub project is now open source).
PPT Automation Skill Chain: @dotey recommends his open-source baoyu-design Skill + baoyu-image-gen Skill combination. This toolset can automatically generate PPTs, videos, or websites with exquisite illustrations locally and can even export the complete layout with images as an editable PPTX file.
Cross-Model Dispatch MCP: @vista8 has open-sourced an MCP that allows Codex to delegate tasks to Claude Code. It even supports multi-turn discussions with more affordable domestic models (like Zhipu and DeepSeek), solving the problem of leveraging the complementary strengths of single models across different scenarios.
Qiaomu Canvas: @vista8 open-sourced an online canvas tool, like a simplified Photoshop, with seamless integration for image generation via Seedream and GPT-image-2. It supports one-click background removal for images, icons, and emojis, making it ideal for drawing product prototypes (PRDs).

🛠️ Productivity and Growth Tools

YouMind: Endorsed by both @AI_Jasonyu (who depends on it for 90% of his creative work) and @gefei55. The tool, now upgraded to v1.0, has a core advantage in generating long-form content that doesn’t feel AI-written. It also resolves persistent formatting issues on platforms like X and WeChat Official Accounts. It’s currently offering a major first-year promotion.
X/Twitter Viral Post Finder: @gefei55 open-sourced a script that uses the Twitter API to cheaply scan for highly-liked posts containing external links, designed to capture emerging products and buzzwords at their earliest stages.
All-in-One Video Translator: @Pluvio9yte recommended a fully automated video localization tool open-sourced by @xiaohu. It integrates downloading, transcribing, translating, polishing, and hardcoding subtitles, making it ideal for repurposing (or learning from) international videos.

🎨 UI and Aesthetics Guides

getdesign.md: A web resource recommended by @Pluvio9yte. It aggregates complete design specification documents from real brands like Linear, Vercel, and Notion. Feeding these files to an AI in the project root can effectively eliminate the generic “AI aesthetic” from the UI and enhance the quality of the generated code.

📚 Appendix: Today’s Watch List Source Update Link to heading

Timeframe: Last 3 days; 22 sources covered; 20 updates in total

ArXiv cs.AI (B_intro+search) Link to heading

Deontic Policies for Runtime Governance of Agentic AI Systems
- Published: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19464v1 Announce Type: new.
  - Abstract: Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can call tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not only by authentication and access control but by the full fabric of enterprise governance.
  - This includes specifying what agents are permitted and prohibited from doing, what they are obliged to do after certain actions (e.g., notify the CISO), under what conditions long-standing obligations can be waived, and which rules take precedence when policies conflict.
  - This governance problem exceeds what current policy engines provide.
- EN Highlights:
  - arXiv:2606.19464v1 Announce Type: new
  - Abstract: Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent t…
  - This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under wh…
  - This governance problem exceeds what current policy engines provide
Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023
- Published: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19469v1 Announce Type: new.
  - Abstract: Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible method to measure how fully they cover the current guideline and how that coverage changes as the guideline is reorganized.
  - We address this with a human-in-the-loop pipeline that measures a program’s coverage of an external body of knowledge, applied longitudinally to an accredited Bachelor of Science in Computer Science against the 2013 (CS2013) and 2023 (CS2023) Computer Science curricula.
  - The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and confirms them through human judgment under an explicit definition of coverage.
- EN Highlights:
  - arXiv:2606.19469v1 Announce Type: new
  - Abstract: Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproduci…
  - We address this with a human-in-the-loop pipeline that measures a program’s coverage of an external body of knowledge, applied longitudinally to one accredited…
  - The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and co…
Diffusion Language Models: An Experimental Analysis
Published: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19475v1 Announce Type: new.
  - Abstract: Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks.
  - Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing for parallel refinement of the entire sequence.
  - While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer.
- EN Highlights:
  - arXiv:2606.19475v1 Announce Type: new
  - Abstract: Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide variety…
  - Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token predic…
  - While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameter…
Hidden Anchors in Multi-Agent LLM Deliberation
- Published: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19494v1 Announce Type: new.
  - Abstract: Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is seldom modeled.
  - This deliberation mirrors how humans reach decisions.
  - As social animals, we are pulled both by the group—the herd effect captured by classical opinion-dynamics models such as DeGroot and Friedkin-Johnsen—and by our own intrinsic beliefs, which these models do not account for.
- EN Highlights:
  - arXiv:2606.19494v1 Announce Type: new
  - Abstract: Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, ye…
  - Such deliberation mirrors how humans reach decisions
  - As social animals we are pulled both by the group, the herd effect that classical opinion-dynamics models such as DeGroot and Friedkin–Johnsen capture, and by…
DeXposure-Claw: An Agentic System for DeFi Risk Supervision
- Published: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19501v1 Announce Type: new.
  - Abstract: Decentralized finance presents regulators with rapidly changing, networked credit risks.
  - General-purpose LLM agents are poorly suited for this environment: they over-read weak evidence and suggest high-risk interventions, while existing evaluations do not provide a regulator-aligned method to measure the resulting false positives.
We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundational model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then translate these forecasts into typed alerts, attribution signals, and scenario evidence; (3) data health and trust gates bound escalation before DeXposure-Claw issues auditable regulatory tickets with reasoning.
- EN Highlights:
  - arXiv:2606.19501v1 Announce Type: new
  - Abstract: Decentralized finance exposes supervisors to fast-moving, networked credit risks
  - General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no re…
  - We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph tim…
LLM Doesn’t Know What It Doesn’t Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data
- Release Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19509v1 Announce Type: new.
  - Abstract: Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on such tasks remains underexplored.
  - We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qwen 2.5 7B and XGBoost on predictive tasks via attribution divergence analysis.
  - First, LLM verbalized confidence is epistemically hollow, outputting near-constants (0.856-0.937) whether accuracy is 49% or 75.3%, tracking prompt format instead of predictive quality.
- EN Highlights:
  - arXiv:2606.19509v1 Announce Type: new
  - Abstract: Large language models (LLMs) are increasingly applied to structured clinical data, yet whether they can recognize the limits of their own knowledge on…
  - We study this question through the lens of cross-model attribution divergence with the goal of reducing epistemic uncertainty for structured tasks, comparing Qw…
  - We report four findings
REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer’s Disease Risk
- Release Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19522v1 Announce Type: new.
  - Abstract: The retina offers a non-invasive window into neurodegenerative diseases, capturing subtle structural patterns associated with future cognitive decline risk.
  - Vision-language alignment frameworks like REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early prediction of Alzheimer’s Disease (AD).
  - A key design choice in these methods is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs during contrastive learning.
- EN Highlights:
arXiv:2606.19522v1 Announce Type: new
Abstract: The retina offers a noninvasive window into neurodegenerative disease, capturing subtle structural patterns associated with a risk of future cognitive…
Vision-language alignment frameworks such as REVEAL have shown that pairing retinal fundus images with structured clinical risk narratives improves early predic…
A key design choice in these approaches is the use of phenotypic grouping, where individuals with similar risk profiles are treated as multi-positive pairs duri…
Emergent Alignment
- Release Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19527v1 Announce Type: new.
  - Abstract: Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics?
  - We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Preference Optimization (DPO) to steer the model away from non-ethical outputs.
  - The result is an online technique that can adapt the model across a wide range of applications: training, fine-tuning, adversarial prompting, and zero-shot learning.
- EN Highlights:
  - arXiv:2606.19527v1 Announce Type: new
  - Abstract: Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics
  - And can they self-correct
  - We endow an LLM with a conscience step that reviews its own reasoning and outputs, and we extend the training loss with an alignment component using Direct Pref…
ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence
- Release Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19538v1 Announce Type: new.
  - Abstract: Convolutional networks, recurrent networks, and transformers each encode different inductive biases—locality, sequential memory, and content-dependent pairwise interactions—and have remained mathematically distinct since their inception.
  - We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform.
  - We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that jointly depends on both position and features.
- EN Highlights:
  - arXiv:2606.19538v1 Announce Type: new
  - Abstract: Convolutional networks, recurrent networks, and transformers each encode different inductive biases – locality, sequential memory, and content-depend…
  - We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying math…
We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features
Uncertainty Decomposition for Clarification Seeking in LLM Agents
- Publish Time: 2026-06-20 12:00 Beijing Time
- Abstract:- arXiv:2606.19559v1 Announce Type: New.
  - Abstract: Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive Large Language Model (LLM) agents, calling for a lack of norm-aware, decomposable, and communicable uncertainty representations that can unlock new agent capabilities such as proactively seeking clarification and shared mental model construction.
  - Practical deployment constraints—black-box APIs, interactive latency budgets, and the absence of labeled trajectories—rule out logprob-based, multi-sampling, and training-based approaches, making prompt-based estimation the most viable family for presenting such signals at deployment.
  - We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when task specifications are ambiguous.
- EN Highlights:
  - arXiv:2606.19559v1 Announce Type: new
  - Abstract: Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) a…
  - Practical deployment constraints – black-box APIs, interactive latency budgets, and the absence of labeled trajectories – rule out logprob-based, multi-sampli…
  - We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarif…

ArXiv cs.CL (B_intro+search) Link to heading

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation
- Publish Time: 2026-06-20 12:00 Beijing Time
- Abstract:- arXiv:2606.19344v1 Announce Type: New.
  - Abstract: Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation.
  - Standard auditing methods rely on a single output inspection or static automated metrics.
  - These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches.
- EN Highlights:
  - arXiv:2606.19344v1 Announce Type: new
  - Abstract: Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generat…
  - Standard auditing methods rely on a single output inspection or static automated metrics
  - These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches
Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts
- Publication Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19345v1 Announcement Type: new.
  - Abstract: The rapid increase in scientific publications has made manual study screening in systematic literature reviews (SLRs) increasingly resource-intensive, inefficient, and inconsistent.
  - Classifying studies that clearly report health-related quality of life outcomes (e.g., EQ-5D data) requires a high level of clinical interpretation, which poses a challenge for human reviewers.
  - This study investigates the use of Google’s Gemini and Gemma large language models (LLMs) to automate EQ-5D detection in the PubMed biomedical database based solely on published abstracts.
- EN Highlights:
  - arXiv:2606.19345v1 Announce Type: new
  - Abstract: The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly re…
  - Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses c…
  - This study investigates the use of Google’s Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based…
Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer
- Publication Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19346v1 Announcement Type: new.
  - Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B–671B parameters) on Arabic and evaluating zero-shot reading comprehension for Semitic and non-Semitic control languages.
  - Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong baseline models show only marginal gains, regardless of language family.
  - A chain-of-thought ablation reinforces this finding—the same models that benefit most from fine-tuning also benefit equally from inference-time reasoning, suggesting that both mechanisms address task format alignment rather than cross-lingual knowledge transfer.
- EN Highlights:
  - arXiv:2606.19346v1 Announce Type: new
  - Abstract: We study cross-lingual transfer by fine-tuning seven large language models (4B–671B parameters) on Arabic and evaluating zero-shot reading comprehens…
  - Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all…
  - A chain-of-thought ablation reinforces this finding – the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggest…
How LLMs Fail and Generalize in RTL Coding for Hardware Design?
- Publication Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19347v1 Announcement Type: new.
  - Abstract: Translating sequential programming priors into the parallel temporal logic of hardware design remains a key bottleneck for large language models (LLMs).
  - To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory.
  - Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types.
- EN Key Points:
  - arXiv:2606.19347v1 Announce Type: new
  - Abstract: Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(L…
  - To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory
  - Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
- Publication Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19348v1 Announcement Type: new.
  - Abstract: We present a preview version of the DeepSeek-V4 series, including two powerful Mixture-of-Experts (MoE) language models - DeepSeek-V4-Pro with 1.6T parameters (49B active) and DeepSeek-V4-Flash with 284B parameters (13B active) - both supporting a context length of 1 million tokens.
  - The DeepSeek-V4 series incorporates several key upgrades in architecture and optimization: (1) a hybrid attention architecture, combining Compressed Sparse Attention (CSA) and Recompressed Attention (HCA), to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC), enhancing traditional residual connections; and (3) the Muon optimizer for faster convergence and greater training stability.
  - We pre-trained both models on over 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline to unlock and further enhance their capabilities.
- EN Key Points:
  - arXiv:2606.19348v1 Announce Type: new
  - Abstract: We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models – DeepSeek-V4-Pro with 1.6T paramet…
  - DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attent…
  - We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances…
Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics
- Published time:2026-06-20 12:00 Beijing Time
- Abstract:- arXiv:2606.19349v1 Announce Type: new.
  - Abstract: While In-Context Learning (ICL) has been widely studied in autoregressive (AR) LLMs, its mechanisms in diffusion Large Language Models (dLLMs) largely remain unexplored.
  - Unlike AR models constrained by unidirectional causal masking, dLLMs intrinsically leverage bidirectional attention, providing extensive spatial flexibility for query placement.
  - Unfortunately, current practices often inherit AR-style trailing query templates, frequently overlooking this shift in structural paradigm.
- EN 要点:
  - arXiv:2606.19349v1 Announce Type: new
  - Abstract: While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remai…
  - Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for qu…
  - Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift
Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models
- Published time:2026-06-20 12:00 Beijing Time
- Abstract:- arXiv:2606.19350v1 Announce Type: new.
  - Abstract: Large Language Models (LLMs) excel at multi-step reasoning but incur substantial inference costs.
  - We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning.
  - For each attention head, CAP estimates the expected performance degradation when that head is masked during forward passes on a small set of reasoning problems.
- EN 要点:
  - arXiv:2606.19350v1 Announce Type: new
  - Abstract: Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost
  - We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tas…
  - For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasonin…
Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning
- Published time:2026-06-20 12:00 Beijing Time
- Abstract:- arXiv:2606.19351v1 Announce Type: new.
Abstract: Knowledge Graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support.
With the rapid development of Large Language Models (LLMs), LLM-based knowledge graph reasoning frameworks have become increasingly popular by leveraging retrieved knowledge graph information.
However, hallucinations in LLMs remain a critical issue.
EN Key Points:
- arXiv:2606.19351v1 Announce Type: new
- Abstract: Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision supp…
- With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG inform…
- However, hallucinations in LLMs remain a critical issue
Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards
- Publication Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19352v1 Announcement Type: New.
  - Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities.
  - Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage.
  - Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited.
- EN Key Points:
  - arXiv:2606.19352v1 Announce Type: new
  - Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities
  - Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotat…
  - Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited
Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence
- Publication Time: 2026-06-20 12:00 Beijing Time
- Abstract: - arXiv:2606.19353v1 Announcement Type: New.
  - Abstract: In-Context Learning (ICL) allows LLMs to adapt to new tasks with a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model’s ability to understand context, blurring whether failures are caused by data properties or model limitations.
  - Uncertainty decomposition (separating aleatoric from epistemic sources) is particularly important in this context, but existing methods designed for standard generation tasks fail to capture the unique dynamics of ICL.
  - To address this, we introduce the concept of eigen-function vectors, which builds on Bayesian perspectives and the mechanistic interpretability of ICL.
- EN Key Points:
  - arXiv:2606.19353v1 Announce Type: new
Abstract: In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly s…
Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard gener…
To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL