Tobacco Taxation Excise Duties For Manufactured Tobacco Products Updated Rules — 03 September 2025 – 31 October 2025
This dashboard presents an independent analysis of public submissions to the European Commission's Have Your Say consultation portal. The consultation analysed is Tobacco Taxation Excise Duties For Manufactured Tobacco Products Updated Rules, which ran from 03 September 2025 – 31 October 2025.
All data is collected from the EU's public API — no login or API key is required, and no non-public data is used. The original consultation page is available at https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/12645-Tobacco-taxation-excise-duties-for-manufactured-tobacco-products-updated-rules-_en.
Submissions are fetched in full via the EU Have Your Say REST API, including the submitter's name, organisation type, country, submission timestamp, and the full text of the feedback. Where a submission includes an attached document (PDF, DOCX, etc.), the text is extracted and treated as part of the submission text for analysis.
The EU portal itself removes 100% identical submissions. This pipeline goes further: it identifies near-duplicate submissions that are worded slightly differently but carry the same content — for example, template letters from campaign organisations where supporters have made small edits. These are detected using TF-IDF cosine similarity (a standard method that measures how alike two texts are based on their word patterns) with a threshold of 0.95: any two submissions whose texts score ≥ 0.95 are grouped together.
No submissions are removed. Near-duplicate groups are flagged and tracked so they can be accounted for in the analysis, but every submission is counted. Very short submissions (fewer than 10 words) are separated into their own group before similarity scoring.
Each submission is classified as For, Against, or Unclear relative to the proposal using a three-layer hybrid system:
microsoft/deberta-v3-large model is
fine-tuned — adapted through further training on this specific dataset — using the best
available labels from the two layers above plus any labels added through manual review.
After each training run the model is saved and used as the starting point for the next,
so accuracy improves as more labelled data accumulates.
Label priority (highest to lowest):
The design principle is unclear is better than wrong: no negation handling, no domain overrides. Only manual labels and model consensus can flip a result that would otherwise be ambiguous.
Beyond near-duplicate detection, the pipeline identifies coordinated submission campaigns — groups of submissions that are semantically similar and were submitted in close succession, even when the wording has been varied enough to avoid the 0.95 text-similarity threshold. Nothing is removed; this analysis adds context to the data.
Each submission is converted into a numerical representation of its meaning using a
multilingual AI model (distiluse-base-multilingual-cased-v2), a technique
called sentence embedding. Every submission is then compared against every other to produce
a similarity score — two texts can score highly even if the exact words differ, as long as
the meaning is similar. Submissions within a configurable time window of each other
(default: 5 minutes) and above a similarity threshold are grouped into
bursts. Each burst is assigned an ID and a size.
Results are shown on the dashboard as a timeline of burst activity over the consultation period. Burst membership is carried forward into the stance analysis so submissions can be identified as part of a coordinated campaign.
A review tool allows verified reviewers to label individual submissions by hand. Manual labels take the highest priority in the system and feed directly into the next supervised training run. Reviewers focus on difficult cases — submissions the model is uncertain about, underrepresented submitter types, and any entries flagged for closer attention.
Multiple reviewers can label the same submission independently. Where reviewers agree, confidence in the label is higher. Where they disagree, the submission is automatically pushed to the top of the queue for other reviewers so that additional votes can break the deadlock — no single person overrides the others. The final label is determined by majority vote across all reviewers.
Individual reviewer labels are never altered — each person's vote is recorded exactly as cast. When labels are imported into the pipeline, any tie involving opposing stances is resolved to Unclear: a For/Against split picks neither side, and a For/Unclear or Against/Unclear split defaults to Unclear as the safer option — leaving the final call to the model rather than forcing a contested human label. A submission is considered settled once the leading label holds 75% or more of all votes cast on it.
The full pipeline — data collection, deduplication, stance detection, and this dashboard — is open source and available on GitHub: EU Have Your Say — Consultation Analysis Toolkit . Issues and contributions are welcome.