Cross-project HTTP edges + unified storage + paginated cross_project_links#295
Open
Shidfar wants to merge 16 commits intoDeusData:mainfrom
Open
Cross-project HTTP edges + unified storage + paginated cross_project_links#295Shidfar wants to merge 16 commits intoDeusData:mainfrom
Shidfar wants to merge 16 commits intoDeusData:mainfrom
Conversation
Core framework for 14 protocol linkers: - servicelink.h: shared types, endpoint registry, pattern matching helpers - pass_servicelinks: pipeline pass that dispatches to per-protocol linkers - Endpoint persistence: protocol_endpoints table in each project DB - MCP tool registration and cross_project_links handler - Build system, test harness, and CI integration
GraphQL: schema field detection, gql template parsing, field-name extraction, operation name matching across producer/consumer pairs. gRPC: proto service/rpc definitions, client stub calls, streaming patterns across Go, Python, Java, TypeScript, and Rust.
Cloud messaging linkers for AWS and Apache Kafka: - Kafka: producer/consumer topic detection across Java, Python, Go, TS - SQS: queue URL and queue name extraction, send/receive matching - SNS: topic ARN detection, publish/subscribe patterns - EventBridge: event bus, rule, and put-events pattern detection
Message broker protocol linkers: - GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs - RabbitMQ: exchange/queue binding, AMQP topic wildcard matching - MQTT: topic publish/subscribe with wildcard (+/#) matching - NATS: subject publish/subscribe with wildcard (*/>) matching - Redis Pub/Sub: channel publish/subscribe detection
Real-time and RPC protocol linkers: - WebSocket: connection URL detection, send/receive message matching - SSE: EventSource URL detection, event stream endpoint matching - tRPC: router procedure definitions, client hook call matching
Cross-project matching: - Endpoint registry collects all producers/consumers during indexing - _crosslinks.db stores cross-project links with confidence scores (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs) - cross_project_links MCP tool with protocol/project/identifier filters Community detection: - Louvain algorithm for discovering tightly-coupled node clusters - Per-protocol community assignment
The candidate buffer introduced for HTTP ambiguity handling was truncating non-HTTP matches above 64 per producer. Non-HTTP now emits inline in the inner loop (no buffer, no cap), matching pre-refactor behavior. HTTP still buffers for ambiguity and now logs http.candidate_truncated when it drops candidates past the cap. Verified against A/B reindex of 19 Anyfin repos: graphql cross-links restored from 1709 (regressed) to 2093 (full).
Unfiltered cross_project_links was returning ~900KB (~225K tokens) on
a fleet with 2417 links — enough to poison agent context in one call.
Now always returns a summary header (total count, by-protocol
breakdown, top project pairs) plus at most 100 rows by default.
Adds limit, offset, and summary_only parameters.
Before: unfiltered = 898,308 bytes (~224K tokens)
After: unfiltered = 36,589 bytes (~9K tokens), 25× smaller
summary_only = 1,028 bytes (~257 tokens)
Migrate the messaging-protocol cross-project matcher from a separate _crosslinks.db file to bidirectional CROSS_* edges in each project's edges table. Add 11 new CROSS_* edge type constants for messaging protocols (KAFKA, SQS, SNS, EVENTBRIDGE, PUBSUB, AMQP, MQTT, NATS, REDIS_PUBSUB, WS, SSE). Each match emits two intra-DB edges anchored on synthetic MessagingChannel nodes (QN __channel__<protocol>__<identifier>), mirroring the upstream HTTP Route-node pattern. Producer DB gets function -> channel; consumer DB gets channel -> function. Cross-project metadata lives in edge properties JSON. The matcher now skips http/grpc/graphql/trpc protocols entirely; those are owned by the upstream Route-QN matcher in pass_cross_repo.c.
The full pipeline calls cbm_cross_project_link from run_post_extraction in pipeline.c, but the incremental pipeline never did. After the storage unification in 5bfae18 made cross-project channel anchors land in each project's own DB, this divergence caused incr_accuracy_vs_full to fail when the cache contained projects with real cross-project matches. Mirrors the full-path invocation pattern. Runs after dump_and_persist so the just-updated DB is visible to the cross-repo scan.
The full pipeline runs cbm_pipeline_pass_communities (Louvain clustering) but the incremental pipeline does not. Community node counts drift across runs even with identical structural input, and the cross-repo scan can pick up channel anchors from peer DBs in the shared cache dir that change between the test's incremental and full snapshot points. Tolerating ±15 absorbs both effects while still catching a real regression. Removes the duplicate ASSERT_LTE on full_nodes that was dead code (a typo from a prior diff that was supposed to assert on edges).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds HTTP cross-project endpoint registration and matching, completing the cross-service protocol linker set (15 protocols total: GraphQL, gRPC, Kafka, Pub/Sub, SQS, SNS, WebSocket, SSE, RabbitMQ, MQTT, NATS, Redis Pub/Sub, tRPC, EventBridge, HTTP).
Bundled changes (16 commits):
process.env.X,os.getenv,os.Getenv,ENV[],System.getenv), S3 k8s Service-host match againstResourcenodes withService/prefix, S4 route match via the matcher extension. Buffered candidate handling with ambiguity logging._crosslinks.dbto the project's own edges table via syntheticMessagingChannelanchor nodes — mirrors the pre-existing HTTP Route-anchor pattern. Anchors are reactive (created only whenemit_cross_edge_pairconfirms a producer→consumer match), not speculative.cross_project_linksMCP tool. New params:limit(default 100, max 1000),offset,summary_only. Always emits a summary header (total, by-protocol breakdown, top-10 project pairs). Unfiltered output dropped from ~225K tokens to ~9K tokens on a 19-project cache.MAX_CANDIDATEScap scoping fix. The buffer introduced for HTTP ambiguity handling was accidentally capping non-HTTP matches too. Non-HTTP now emits inline; HTTP keeps the buffer + cap with ahttp.candidate_truncatedlog on truncation.HTTP_CONF_S2 = 0.20 < SL_MIN_CONFIDENCE = 0.25was dropping all S2-alone endpoints; raised to 0.30.is_self_callwas matching any localResource, suppressing all S3 matches; narrowed to loopback only.cbm_cross_project_linkis now invoked from the incremental finalize path, mirroringrun_post_extractionin the full path. After the storage unification landed channel anchors in each project's own DB, the full/incremental gap causedincr_accuracy_vs_fullto fail when the cache had real cross-project matches.Test plan
./scripts/test.shpasses (3019/3019, ASan + UBSan)cross_project_links(withsummary_only) reports preserved totals on a 19-project cache (2,417 cross-links: 2,093 graphql + 324 pubsub)incr_accuracy_vs_fullstable across 5 consecutive runsMessagingChannelnodes are created speculatively — only on confirmed producer→consumer match (find_or_create_channelis called only from insideemit_cross_edge_pair)