About this analysis — Tobacco Taxation Excise Duties For Manufactured Tobacco Products Updated Rules

What is this?

This dashboard presents an independent analysis of public submissions to the European Commission's Have Your Say consultation portal. The consultation analysed is Tobacco Taxation Excise Duties For Manufactured Tobacco Products Updated Rules, which ran from 03 September 2025 – 31 October 2025.

All data is collected from the EU's public API — no login or API key is required, and no non-public data is used. The original consultation page is available at https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/12645-Tobacco-taxation-excise-duties-for-manufactured-tobacco-products-updated-rules-_en.

Data collection

Submissions are fetched in full via the EU Have Your Say REST API, including the submitter's name, organisation type, country, submission timestamp, and the full text of the feedback. Where a submission includes an attached document (PDF, DOCX, etc.), the text is extracted and treated as part of the submission text for analysis.

Near-duplicate detection

The EU portal itself removes 100% identical submissions. This pipeline goes further: it identifies near-duplicate submissions that are worded slightly differently but carry the same content — for example, template letters from campaign organisations where supporters have made small edits. These are detected using TF-IDF cosine similarity (a standard method that measures how alike two texts are based on their word patterns) with a threshold of 0.95: any two submissions whose texts score ≥ 0.95 are grouped together.

No submissions are removed. Near-duplicate groups are flagged and tracked so they can be accounted for in the analysis, but every submission is counted. Very short submissions (fewer than 10 words) are separated into their own group before similarity scoring.

Stance detection

Each submission is classified as For, Against, or Unclear relative to the proposal using a three-layer hybrid system:

Rule-based layer. A short list of keywords is scored against both sides of the debate. A label is assigned only when one side leads by at least one point; otherwise the submission is left unclassified at this stage. This layer is fast but reads words literally — it has no understanding of context or meaning.
Zero-shot layer. Three independent AI language models based on DeBERTa-v3-large (a model developed by Microsoft) are run on each submission. "Zero-shot" means they can classify text without having been specifically trained on this consultation — they reason from general language understanding. When all three models agree on the same label, that consensus is used as a high-confidence training signal.
Supervised layer. A microsoft/deberta-v3-large model is fine-tuned — adapted through further training on this specific dataset — using the best available labels from the two layers above plus any labels added through manual review. After each training run the model is saved and used as the starting point for the next, so accuracy improves as more labelled data accumulates.

Label priority (highest to lowest):

Manual review by a human reviewer
High-confidence pseudo-labels (entries where the model's own confidence score was ≥ 0.85 on a previously unlabelled submission)
Zero-shot consensus (all three models agree)
Rule-based keyword match

The design principle is unclear is better than wrong: no negation handling, no domain overrides. Only manual labels and model consensus can flip a result that would otherwise be ambiguous.

Coordinated submission analysis

Beyond near-duplicate detection, the pipeline identifies coordinated submission campaigns — groups of submissions that are semantically similar and were submitted in close succession, even when the wording has been varied enough to avoid the 0.95 text-similarity threshold. Nothing is removed; this analysis adds context to the data.

Each submission is converted into a numerical representation of its meaning using a multilingual AI model (distiluse-base-multilingual-cased-v2), a technique called sentence embedding. Every submission is then compared against every other to produce a similarity score — two texts can score highly even if the exact words differ, as long as the meaning is similar. Submissions within a configurable time window of each other (default: 5 minutes) and above a similarity threshold are grouped into bursts. Each burst is assigned an ID and a size.

Results are shown on the dashboard as a timeline of burst activity over the consultation period. Burst membership is carried forward into the stance analysis so submissions can be identified as part of a coordinated campaign.

Manual review

A review tool allows verified reviewers to label individual submissions by hand. Manual labels take the highest priority in the system and feed directly into the next supervised training run. Reviewers focus on difficult cases — submissions the model is uncertain about, underrepresented submitter types, and any entries flagged for closer attention.

Multiple reviewers can label the same submission independently. Where reviewers agree, confidence in the label is higher. Where they disagree, the submission is automatically pushed to the top of the queue for other reviewers so that additional votes can break the deadlock — no single person overrides the others. The final label is determined by majority vote across all reviewers.

Individual reviewer labels are never altered — each person's vote is recorded exactly as cast. When labels are imported into the pipeline, any tie involving opposing stances is resolved to Unclear: a For/Against split picks neither side, and a For/Unclear or Against/Unclear split defaults to Unclear as the safer option — leaving the final call to the model rather than forcing a contested human label. A submission is considered settled once the leading label holds 75% or more of all votes cast on it.

Limitations and disclaimer

Stance classifications are model-generated and may contain errors. Do not treat them as definitive policy conclusions.
The pipeline analyses the text of submissions. It cannot verify the identity, affiliation, or authenticity of submitters.
Multi-language submissions are handled in the language they were submitted; translation quality may affect classification accuracy for non-English texts.
This analysis is independent and is not affiliated with or endorsed by the European Commission.
This work was done voluntarily, at no cost, with no funding, sponsorship, or compensation from any party.
The EU portal's terms of service apply to the underlying data.

Open source

The full pipeline — data collection, deduplication, stance detection, and this dashboard — is open source and available on GitHub: EU Have Your Say — Consultation Analysis Toolkit . Issues and contributions are welcome.